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Abstract 


In  complex  sequential  decision  problems  such  as  scheduling  factory  production,  plan¬ 
ning  medical  treatments,  and  playing  backgammon,  optimal  decision  policies  are  in 
general  unknown,  and  it  is  often  difficult,  even  for  human  domain  experts,  to  manually 
encode  good  decision  policies  in  software.  The  reinforcement-learning  methodology 
of  “value  function  approximation”  (VFA)  offers  an  alternative:  systems  can  learn 
effective  decision  policies  autonomously,  simply  by  simulating  the  task  and  keeping 
statistics  on  which  decisions  lead  to  good  ultimate  performance  and  which  do  not. 
This  thesis  advances  the  state  of  the  art  in  VFA  in  two  ways. 

First,  it  presents  three  new  VFA  algorithms,  which  apply  to  three  different  re¬ 
stricted  classes  of  sequential  decision  problems:  Grow-Support  for  deterministic  prob¬ 
lems,  ROUT  for  acyclic  stochastic  problems,  and  Least-Squares  TD(A)  for  fixed-policy 
prediction  problems.  Each  is  designed  to  gain  robustness  and  efficiency  over  current 
approaches  by  exploiting  the  restricted  problem  structure  to  which  it  applies. 

Second,  it  introduces  STAGE,  a  new  search  algorithm  for  general  combinatorial 
optimization  tasks.  STAGE  learns  a  problem-specific  heuristic  evaluation  function  as 
it  searches.  The  heuristic  is  trained  by  supervised  linear  regression  or  Least-Squares 
TD(A)  to  predict,  from  features  of  states  along  the  search  trajectory,  how  well  a  fast 
local  search  method  such  as  hillclimbing  will  perform  starting  from  each  state.  Search 
proceeds  by  alternating  between  two  stages:  performing  the  fast  search  to  gather  new 
training  data,  and  following  the  learned  heuristic  to  identify  a  promising  new  start 
state. 

STAGE  has  produced  good  results  (in  some  cases,  the  best  results  known)  on 
a  variety  of  combinatorial  optimization  domains,  including  VLSI  channel  routing, 
Bayes  net  structure-finding,  bin-packing,  Boolean  satisfiability,  radiotherapy  treat¬ 
ment  planning,  and  geographic  cartogram  design.  This  thesis  describes  the  results 
in  detail,  analyzes  the  reasons  for  and  conditions  of  STAGE’S  success,  and  places 
STAGE  in  the  context  of  four  decades  of  research  in  local  search  and  evaluation  func¬ 
tion  learning.  It  provides  strong  evidence  that  reinforcement  learning  methods  can 
be  efficient  and  effective  on  large-scale  decision  problems. 
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Chapter  1 

Introduction 


In  the  industrial  age,  humans  delegated  physical  labor  to  machines.  Now,  in  the 
•  information  age,  we  are  increasingly  delegating  mental  labor,  charging  computers 
with  such  tasks  as  controlling  traffic  signals,  scheduling  factory  production,  planning 
medical  treatments,  allocating  investment  portfolios,  routing  data  through  commu¬ 
nications  networks,  and  even  playing  expert-level  backgammon  or  chess.  Such  tasks 
are  difficult  sequential  decision  problems: 

•  the  task  calls  not  for  a  single  decision,  but  rather  for  a  whole  series  of  decisions 
over  time; 

•  the  outcome  of  any  decision  may  depend  on  random  environmental  factors  be¬ 
yond  the  computer’s  control;  and 

•  the  ultimate  objective — measured  in  terms  of  traffic  flow,  patient  health,  busi¬ 
ness  profit,  or  game  victory — depends  in  a  complicated  way  on  many  interacting 
decisions  and  their  random  outcomes. 

In  such  complex  problems,  optimal  decision  policies  are  in  general  unknown,  and  it  is 
often  difficult,  even  for  human  domain  experts,  to  manually  encode  even  reasonably 
good  decision  policies  in  software.  A  growing  body  of  research  in  Artificial  Intelligence 
suggests  the  following  alternative  methodology: 

A  decision-making  algorithm  can  autonomously  learn  effective 
policies  for  sequential  decision  tasks,  simply  by  simulating  the 
task  and  keeping  statistics  on  which  decisions  tend  to  lead  to 
good  ultimate  performance  and  which  do  not. 

The  field  of  reinforcement  learning,  to  which  this  thesis  contributes,  defines  a  princi¬ 
pled  foundation  for  this  methodology. 

1.1  Motivation:  Learning  Evaluation  Functions 

In  Artificial  Intelligence,  the  fundamental  data  structure  for  decision-making  in  large 
state  spaces  is  the  evaluation  function.  Which  state  should  be  visited  next  in  the 
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search  for  a  better,  nearer,  cheaper  goal  state?  The  evaluation  function  maps  features 
of  each  state  to  a  real  value  that  assesses  the  state’s  promise.  For  example,  in  the  do¬ 
main  of  chess,  a  classic  evaluation  function  is  obtained  by  summing  material  advantage 
weighted  by  1  for  pawns,  3  for  bishops  and  knights,  5  for  rooks,  and  9  for  queens.  The 
choice  of  evaluation  function  “critically  determines  search  results”  [Nilsson  80,  p.74] 
in  popular  algorithms  for  planning  and  control  (^4*),  game-playing  (alpha-beta),  and 
combinatorial  optimization  (hillclimbing,  simulated  annealing). 

Evaluation  functions  have  generally  been  designed  by  human  domain  experts. 
The  weights  {1,3, 3,5,9}  in  the  chess  evaluation  function  given  above  summarize  the 
judgment  of  generations  of  chess  players.  IBM’s  Deep  Blue  chess  computer,  which 
defeated  world  champion  Garry  Kasparov  in  a  1997  match,  used  an  evaluation  func¬ 
tion  of  over  8000  tunable  parameters — the  values  of  which  were  set  initially  by  an 
automatic  procedure,  but  later  carefully  hand-tuned  under  the  guidance  of  a  human 
grandmaster  [Hsu  et  al.  90,  Campbell  98].  Similar  tuning  occurs  in  combinatorial 
optimization  domains  such  as  the  Traveling  Salesperson  Problem  [Lin  and  Kernighan 
73]  and  VLSI  circuit  design  tasks  [Wong  tt  al.  88].  In  such  domains  the  state  space 
consists  of  legal  candidate  solutions,  and  the  domain’s  objective  function — the  func¬ 
tion  that  evaluates  the  quality  of  a  final  solution — can  itself  serve  as  an  evaluation 
function  to  guide  search.  However,  if  the  objective  function  has  many  local  optima 
or  regions  of  constant  value  (plateaus)  with  respect  to  the  available  search  moves, 
then  it  will  not  be  effective  as  an  evaluation  function.  Thus,  to  get  good  optimization 
results,  engineers  often  spend  considerable  effort  tweaking  the  coefficients  of  penalty 
terms  and  other  additions  to  their  objective  function;  I  cite  several  examples  of  this 
in  Chapter  3.  Clearly,  automatic  methods  for  building  evaluation  functions  offer 
the  potential  both  to  save  human  effort  and  to  optimize  search  performance  more 
effectively. 

1.2  The  Promise  of  Reinforcement  Learning 

Reinforcement  learning  (RL)  provides  a  solid  foundation  for  learning  evaluation  func- 
tionsTor  sequential  decision  problems.  Standard  RL  methods  assume  that  the  prob¬ 
lem  can  be  formalized  as  a  Markov  decision  process  (MDP),  a  model  of  controllable 
dynamic  systems  used  widely  in  control  theory,  artificial  intelligence,  and  operations 
research  [Puterman  94].  I  describe  the  MDP  model  in  detail  in  Chapter  2.  The  key 
fact  about  this  model  is  that  for  any  MDP,  there  exists  a  special  evaluation  function 
known  as  the  optimal  value  function.  Denoted  by  V*(a;),  the  optimal  value  function 
predicts  the  expected  long-term  reward  available  from  each  state  x  when  all  future 
decisions  are  made  optimally.  V*  is  an  ideal  evaluation  function;  a  greedy  one-step 
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lookahead  search  based  on  V*  identifies  precisely  the  optimal  long-term  decision  to 
make  at  each  state.  The  problem,  then,  becomes  how  to  compute  V*. 

Algorithms  for  computing  V*  are  well  understood  in  the  case  where  the  MDP  state 
space  is  relatively  small  (say,  fewer  than  10^  discrete  states),  so  that  V*  can  be  imple¬ 
mented  as  a  lookup  table.  In  small  MDPs,  if  we  have  access  to  the  transition  model 
which  tells  us  the  distribution  of  successor  states  that  will  result  from  applying  a  given 
action  in  a  given  state,  then  V*  may  be  calculated  exactly  by  a  variety  of  classical  algo¬ 
rithms  such  as  dynamic  programming  or  linear  programming  [Puterman  94].  In  small 
MDPs  where  the  explicit  transition  model  is  not  available,  we  must  build  V*  from 
sample  trajectories  generated  by  direct  interaction  with  a  simulation  of  the  process;  in 
this  case,  recently  discovered  reinforcement  learning  methods  such  as  TD(A)  [Sutton 
88],  Q-learning  [Watkins  89],  and  Prioritized  Sweeping  [Moore  and  Atkeson  93]  apply. 
These  algorithms  apply  dynamic  programming  in  an  asynchronous,  incremental  way, 
but  under  suitable  conditions  can  still  be  shown  to  converge  to  V*  [Bertsekas  and 
Tsitsiklis  96,Littman  and  Szepesvari  96]. 

The  situation  is  very  different  for  large-scale  decision  tasks,  such  as  the  trans¬ 
portation  and  medical  domains  mentioned  at  the  start  of  this  chapter.  These  tasks 
have  high-dimensional  state  spaces,  so  enumerating  V*  in  a  table  is  intractable — a 
problem  known  as  the  “curse  of  dimensionality”  [Bellman  57].  One  approach  to  es¬ 
caping  this  curse  is  to  approximate  V*  compactly  using  a  function  approximator  such 
as  linear  regression  or  a  neural  network.  The  combination  of  reinforcement  learning 
and  function  approximation,  known  as  neuro-dynamic  programming  [Bertsekas  and 
Tsitsiklis  96]  or  value  function  approximation  [Boyan  et  al.  95],  has  produced  several 
notable  successes  on  such  problems  as  backgammon  [Tesauro  92,  Boyan  92],  job-shop 
scheduling  [Zhang  and  Dietterich  95],  and  elevator  control  [Crites  and  Barto  96]. 
However,  these  implementations  are  extremely  computationally  intensive,  requiring 
many  thousands  or  even  millions  of  simulated  trajectories  to  reach  top  performance. 
Furthermore,  when  general  function  approximators  are  used  instead  of  lookup  ta¬ 
bles,  the  convergence  proofs  for  nearly  all  dynamic  programming  and  reinforcement 
learning  algorithms  fail  to  carry  through  [Boyan  and  Moore  95,  Bertsekas  95,  Baird 
95,  Gordon  95].  Perhaps  the  strongest  convergence  result  for  value  function  approx¬ 
imation  to  date  is  the  following  [Tsitsiklis  and  Roy  96]:  for  an  MDP  with  a  fixed 
decision-making  policy,  the  TD(A)  algorithm  may  be  used  to  calculate  an  accurate 
linear  approximation  to  the  value  function.  Though  its  assumption  of  a  fixed  policy 
is  quite  limiting,  this  theorem  nonetheless  applies  to  the  learning  done  by  STAGE,  a 
practical  algorithm  for  global  optimization  introduced  in  this  dissertation. 
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1.3  Outline  of  the  Dissertation 

This  thesis  aims  to  advance  the  state  of  the  art  in  value  function  approximation  for 
large,  practical  sequential  decision  tasks.  It  addresses  two  questions: 

1.  Can  we  devise  new  methods  for  value  function  approximation  that  are  robust 
and  efficient? 

2.  Can  we  apply  the  currently  available  convergence  results  to  practical  problems? 
Both  questions  are  answered  in  the  affirmative: 

1.  I  discuss  three  new  algorithms  for  value  function  approximation,  which  apply 
to  three  different  restricted  classes  of  Markov  decision  processes:  Grow-Support 
for  large  deterministic  MDPs  (§2.2),  ROUT  for  large  acyclic  MDPs  (§2.3),  and 
Least-Squares  TD(A)  for  large  Markov  chains  (§6.1).  Each  is  designed  to  gain 
robustness  and  efficiency  by  exploiting  the  restricted  MDP  structure  to  which 
it  applies. 

2.  I  introduce  STAGE,  a  new  reinforcement  learning  algorithm  designed  specif¬ 
ically  for  large-scale  global  optimization  tasks.  In  STAGE,  commonly  applied 
local  optimization  algorithms  such  as  stochastic  hillclimbing  are  viewed  as  in¬ 
ducing  fixed  decision  policies  on  an  MDP.  Given  that  view,  TD(A)  or  supervised 
learning  may  be  applied  to  learn  an  approximate  value  function  for  the  policy. 
STAGE  then  exploits  the  learned  value  function  to  improve  optimization  per¬ 
formance  in  real  time. 

The  thesis  is  organized  as  follows: 

Chapter  2  presents  formal  definitions  and  notation  for  Markov  decision  processes 
and  value  function  approximation.  It  then  summarizes  Grow-Support  and 
ROUT,  algorithms  which  learn  to  approximate  V*  in  deterministic  and  acyclic 
MDPs,  respectively.  Both  these  algorithms  build  V*  strictly  backward  from  the 
goal,  even  when  given  only  a  forward  simulation  model,  as  is  usually  the  case. 
These  algorithms  have  been  presented  previously  [Boyan  and  Moore  95,  Boyan 
and  Moore  96],  but  this  chapter  offers  a  new  unified  discussion  of  both  algo¬ 
rithms  and  new  results  and  analysis  for  ROUT. 

Chapter  3  introduces  STAGE,  the  algorithm  which  is  the  main  contribution  of  this 
dissertation  [Boyan  and  Moore  97,  Boyan  and  Moore  98],  STAGE  is  a  practical 
method  for  applying  value  function  approximation  to  arbitrary  large-scale  global 
optimization  problems.  This  chapter  motivates  and  describes  the  algorithm  and 
discusses  issues  of  theoretical  soundness  and  computational  efficiency. 
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Chapter  4  presents  empirical  results  with  STAGE  on  seven  large-scale  optimization 
domains:  bin-packing,  channel  routing,  Bayes  net  structure-finding,  radiother¬ 
apy  treatment  planning,  cartogram  design.  Boolean  formula  satisfiability,  and 
Boggle  board  setup.  The  results  show  that  on  a  wide  range  of  problems,  STAGE 
learns  efficiently,  effectively,  and  with  minimal  need  for  problem-specific  param¬ 
eter  tuning. 

Chapter  5  analyzes  STAGE’S  success,  giving  evidence  that  reinforcement  learning 
is  indeed  responsible  for  the  observed  improvements  in  performance.  The  sensi¬ 
tivity  of  the  algorithm  to  various  user  choices,  such  as  the  feature  representation 
and  function  approximator,  and  to  various  algorithmic  choices,  such  as  when  to 
end  a  trial  and  how  to  begin  a  new  one,  is  tested  empirically. 

Chapter  6  offers  two  significant  investigations  beyond  the  basic  STAGE  algorithm. 
In  Section  6.1,  I  describe  a  least-squares  implementation  of  TD(A),  which  gen¬ 
eralizes  both  standard  supervised  linear  regression  and  earlier  results  on  least- 
squares  TD(0)  [Bradtke  and  Barto  96].  In  Section  6.2,  I  discuss  ways  of  trans¬ 
ferring  knowledge  learned  by  STAGE  from  already-solved  instances  to  novel 
similar  instances,  with  the  goal  of  saving  training  time. 

Chapter  7  reviews  the  relevant  work  from  the  optimization  and  AI  literatures,  sit¬ 
uating  STAGE  at  the  confluence  of  adaptive  multi-start  local  search  methods, 
reinforcement  learning  methods,  genetic  algorithms,  and  evaluation  function 
learning  techniques  for  game-playing  and  problem-solving  search. 

Chapter  8  concludes  with  a  summary  of  the  thesis  contributions  and  a  discussion 
of  the  many  directions  for  future  research  in  value  function  approximation  for 
optimization. 


17 


Chapter  2 

Learning  Evaluation  Functions  for  Sequential 

Decision  Making 


Given  only  a  simulator  for  a  complex  task  and  a  measure  of  overall  cumulative  per¬ 
formance,  how  can  we  efficiently  build  an  evaluation  function  which  enables  optimal 
or  near-optimal  decisions  to  be  made  at  every  choice  point?  This  chapter  discusses 
approaches  based  on  the  formalism  of  Markov  decision  processes  and  value  functions. 
After  introducing  the  notation  which  will  be  used  throughout  this  dissertation,  I  give 
a  review  of  the  literature  on  value  function  approximation.  I  then  discuss  two  original 
approaches,  Grow-Support  and  ROUT,  for  approximating  value  functions  robustly  in 
certain  restricted  problem  classes. 


2.1  Value  Function  Approximation  (VFA) 

The  optimal  value  function  is  an  evaluation  function  which  encapsulates  complete 
knowledge  of  the  best  expected  search  outcome  attainable  from  each  state: 

=  the  expected  long-term  reward  starting  from  x,  assuming  optimal  decisions. 

(2.1) 

Such  an  evaluation  function  is  ideal  in  that  a  greedy  local  search  with  respect  to  V* 
will  always  make  the  globally  optimal  move.  In  this  section,  I  formalize  the  above 
definition,  review  the  literature  on  computing  and  motivate  a  new  class  of 

approximation  algorithms  for  this  problem  based  on  working  backwards. 

2.1.1  Markov  Decision  Processes 

Formally,  let  our  search  space  be  represented  as  a  Markov  decision  process  (MDP), 
defined  by 

•  a  finite  set  of  states  V,  including  a  set  of  start  states  S  C  X] 

•  a  finite  set  of  actions  A; 

•  a  reward  function  i?  :  A  x  A  — )■  5R,  where  R{x,  a)  is  the  expected  immediate 
reward  (or  negative  cost)  for  taking  action  a  in  state  X]  and 
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•  a  transition  model  P  :  X  x  X  x  A  ^  di,  where  P(x'\x,a)  gives  the  probability 
that  executing  action  a  in  state  x  will  lead  to  state  x'. 

An  agent  in  an  MDP  environment  observes  its  current  state  xt,  selects  an  action 
and  as  a  result  receives  a  reward  rt  and  moves  probabilistically  to  another  state 
Xt+i-  It  is  assumed  that  the  agent  can  fully  observe  its  current  state  at  all  times; 
more  general  partially  observable  MDP  models  [Littman  96]  are  beyond  the  scope  of 
this  dissertation.  The  basic  MDP  model  is  flexible  enough  to  represent  AI  planning 
problems,  stochastic  games  (e.g.,  backgammon)  against  a  fixed  opponent,  and  com¬ 
binatorial  optimization  search  spaces.  With  natural  extensions,  it  can  also  represent 
continuous  stochastic  control  domains,  two-player  games,  and  many  other  problem 
formulations  [Littman  94,  Harmon  et  al.  95,  Littman  and  Szepesvari  96,  Mahadevan 
et  al.  97]. 

Decisions  in  an  MDP  are  represented  by  a  policy  n  :  X  A,  which  maps  each 
state  to  a  chosen  action  (or,  more  generally,  a  probability  distribution  over  actions). 
I  assume  that  the  policy  is  stationary,  that  is,  unchanging  over  the  course  of  a  simu¬ 
lation.  For  any  stationary  policy  tt,  the  policy  value  function  V’^(a;)  is  defined  as  the 
expected  long-term  reward  accumulated  by  starting  from  state  x  and  following  policy 
7r  thereafter: 

OO 

V'(x)  =  E{y27*fi(a^<,>rW)  I  x„  =  x],  (2.2) 

t=0 

Here,  7  G  [0, 1]  is  a  discount  factor  which  determines  the  extent  of  our  preference 
for  short-term  rewards  over  long-term  rewards.  Assuming  bounded  rewards,  is 
certainly  well-defined  for  any  choice  of  7  <  1;  in  the  undiscounted  case  of  7  = 
1,  remains  well-defined  under  the  additional  condition  that  every  trajectory  is 
guaranteed  to  terminate,  i.e.,  reach  a  special  absorbing  state  for  which  all  further 
rewards  are  0.  Most  of  the  problems  considered  in  this  dissertation  have  this  property 
naturally;  furthermore,  an  arbitrary  MDP  evaluated  with  a  discount  factor  7  <  1  may 
be  transformed  into  an  absorbing  MDP  whose  undiscounted  returns  are  equivalent 
to  the  original  problem’s  discounted  returns,  simply  by  introducing  a  termination 
probability  of  1  —  7  at  each  state.  Therefore,  I  will  generally  assume  7  =  1  throughout 
this  dissertation,  giving  equal  weight  to  short-term  and  long-term  rewards. 

The  policy  value  function  satisfies  this  linear  system  of  Bellman  equations  for 
prediction: 

Va;,  V‘^{x)  =  R{x,  7r(x))  +  7  ^  P{x'\x,  'k{x))V^ {x') 

x'ex 


(2.3) 
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The  solution  to  an  MDP  is  an  optimal  policy  tt*  which  simultaneously  maximizes 
V'"{x)  at  every  state  x.  A  deterministic  optimal  policy  exists  for  every  MDP  [Bellman 
57].  The  policy  value  function  of  tt*  is  the  optimal  value  function  V*  of  Equation  2.1. 
It  satisfies  the  Bellman  equations  for  control: 


Vx,  V*{x)  =  max[/?(a:,a)  +7  ^  P(x'\x,a)V*{x')\ 

x'€X 


(2.4) 


From  the  value  function  V*,  it  is  easy  to  recover  the  optimal  policy:  at  any  state  x, 
any  action  which  instantiates  the  max  in  Equation  2.4  is  an  optimal  choice  [Bellman 
57].  This  formalizes  the  notion  that  V*  is  an  ideal  evaluation  function. 

Algorithms  for  computing  V*  are  well  understood  in  the  case  where  the  MDP 
state  space  is  relatively  small  (say,  fewer  than  10^  discrete  states),  so  that  V*  can  be 
implemented  as  a  lookup  table.  In  small  MDPs,  if  we  have  explicit  knowledge  of  the 
transition  model  P{x'\x,a)  and  reward  function  R{x,a),  then  V*  may  be  calculated 
exactly  by  a  variety  of  classical  algorithms  such  as  linear  programming  [D’Epenoux 
63],  policy  iteration  [Howard  60],  modified  policy  iteration  [Puterman  and  Shin  78], 
or  value  iteration  [Bellman  57].  In  small  MDPs  where  the  transition  model  is  not 
explicitly  available,  we  must  build  V*  from  sample  trajectories  generated  by  direct 
interaction  with  a  simulation  of  the  process;  in  this  case,  reinforcement  learning  (RL) 
methods  apply.  RL  methods  are  either  model-based,  which  means  they  build  an 
empirical  transition  model  from  the  sample  trajectories  and  then  apply  one  of  the 
aforementioned  classical  algorithms  (e.g.,  Dyna-Q  [Sutton  90],  Prioritized  Sweeping 
[Moore  and  Atkeson  93]) — or  model-free,  which  means  they  estimate  V*  values  directly 
(e.g.,  TD(A)  [Sutton  88],  Q-learning  [Watkins  89],  SARSA  [Rummery  and  Niranjan 
94,  Singh  and  Sutton  96]).  I  will  have  more  to  say  on  the  issue  of  model-based  versus 
model-free  algorithms  in  Section  6.1.2.  Broadly  speaking,  all  these  cdgorithms  may 
be  viewed  as  applying  dynamic  programming  in  an  asynchronous,  incremental  way; 
and  under  suitable  conditions,  all  can  still  be  shown  to  converge  to  the  exact  optimal 
value  function  [Bertsekas  and  Tsitsiklis  96,  Singh  et  al.  98]. 

The  situation  is  very  different  for  practical  large-scale  decision  tasks.  These  tasks 
have  high-dimensional  state  spaces,  so  enumerating  V*  in  a  table  is  intractable — a 
problem  known  as  the  “curse  of  dimensionality”  [Bellman  57].  Computing  V*  requires 
generalization.  One  natural  approach  is  to  encode  the  states  as  real- valued  feature 
vectors  and  to  use  a  function  approximator  to  fit  V*  over  the  feature  space.  This 
approach  goes  by  the  name  value  function  approximation  (VFA)  [Boyan  et  al.  95]. 
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2.1.2  VFA  Literature  Review 

The  current  state  of  the  art  in  value  function  approximation  is  surveyed  thoroughly  in 
the  book  Neuro-Dynamic  Programming  [Bertsekas  and  Tsitsiklis  96].  Here,  I  briefly 
review  the  history  of  the  field  and  the  state  of  the  art,  so  as  to  place  this  chapter’s 
algorithms  in  context. 

Any  review  of  the  literature  on  reinforcement  learning  and  evaluation  functions 
must  begin  with  the  pioneering  work  of  Arthur  Samuel  on  the  game  of  checkers 
[Samuel  59,  Samuel  67].  Samuel  implicitly  recognized  the  worth  of  the  value  function, 
saying  that 

...  we  are  attempting  to  make  the  score,  calculated  for  the  current  board 
position,  look  like  that  calculated  for  the  terminal  board  position  of  the 
chain  of  moves  which  most  probably  will  occur  during  actual  play.  Of 
course,  if  one  could  develop  a  perfect  system  of  this  sort  it  would  be 
the  equivalent  of  always  looking  ahead  to  the  end  of  the  game.  [Samuel 
59,  p.  219] 

Samuel’s  program  incrementally  changed  the  coefficients  of  ah  evaluation  polynomial 
so  as  to  make  each  visited  state’s  value  closer  to  the  value  obtained  from  lookahead 
search. 

In  the  dynamic-programming  community.  Bellman  [63]  and  others  explored  poly¬ 
nomial  and  spline  fits  for  value  function  approximation  in  continuous  MDPs;  reviews 
of  these  efforts  may  be  found  in  [Johnson  et  al  93,  Rust  96].  But  Artificial  Intel¬ 
ligence  research  into  evaluation  function  learning  was  sporadic  until  the  1980s.  In 
the  domain  of  chess,  Christensen  [86]  tried  replacing  Samuel’s  coefficient-tweaking 
procedure  with  least-squares  regression,  and  was  able  to  learn  reasonable  weights  for 
the  chess  material-advantage  function.  In  Othello,  Lee  and  Mahajan  [88]  trained  a 
nonlinear  evaluation  function  on  expertly  played  games,  and  it  played  at  a  high  level. 
Christensen  and  Korf  [86]  put  forth  a  unified  theory  of  heuristic  evaluation  functions, 
advocating  the  principles  of  “outcome  determination”  and  “move  invariance” ;  these 
correspond  precisely  to  the  two  key  properties  of  MDP  value  functions,  that  they  rep¬ 
resent  long-term  predictions  and  that  they  satisfy  the  Bellman  equations.  Finally,  in 
the  late  1980s,  the  reinforcement  learning  community  elaborated  the  deep  connection 
between  AI  search  and  dynamic  programming  [Barto  et  al.  89,  Watkins  89,  Sutton 
90,Barto  et  al.  95].  This  connection  had  been  unexplored  despite  the  publication  of 
an  AI  textbook  by  Richard  Bellman  himself  [Bellman  78]. 

Reinforcement  learning’s  most  celebrated  success  has  also  been  in  a  game  domain: 
backgammon  [Tesauro  92,Tesauro  94].  Tesauro  modified  Sutton’s  TD(A)  algorithm 
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[Sutton  88],  which  is  designed  to  approximate  for  a  fixed  policy  tt,  to  the  task 
of  learning  an  optimal  value  function  V*  and  optimal  policy.  The  modification  is 
simple:  instead  of  generating  sample  trajectories  by  simulating  a  fixed  policy  tt, 
generate  sample  trajectories  by  simulating  the  policy  fj,  which  is  greedy  with  respect 
to  the  current  value  function  approximation  V : 

Pl{x)  =  argmax[i2(x,  a)  +  7  F(a;'|x,  a)V(x7]  (2-5) 

x'ex 

(Occasional  non-greedy  “exploration”  moves  are  also  usually  performed  [Thrun  92, 
Singh  et  al.  98],  but  were  found  unnecessary  in  backgammon  because  of  the  domain’s 
inherent  stochasticity  [Tesauro  92].)  The  modified  algorithm  has  been  termed  opti¬ 
mistic  TD{X)  [Bertsekas  and  Tsitsiklis  96],  because  little  is  known  of  its  convergence 
properties.  An  implementation  is  sketched  in  Table  2.1,2.  When  A  =  0,  the  algorithm 
strongly  resembles  Real-Time  Dynamic  Programming  (RTDP)  [Barto  et  al.  95],  ex¬ 
cept  that  RTDP  assigns  target  values  at  each  state  by  a  “full  backup”  (averaging  over 
all  possible  outcomes,  as  in  value  iteration)  rather  than  TD(0)’s  “sample  backups” 
(learning  from  only  the  single  observed  outcome).  Applying  optimistic  TD(A)  with 
a  multi-layer  perceptron  function  approximator,  Tesauro’s  program  learned  an  eval¬ 
uation  function  which  produced  expert-level  backgammon  play.  These  results  have 
been  replicated  by  myself  [Boyan  92]  and  others. 

Tesauro’s  combination  of  optimistic  TD(A)  and  neural  networks  has  been  ap¬ 
plied  to  other  domains,  including  elevator  control  [Crites  and  Barto  96]  and  job-shop 
scheduling  [Zhang  and  Dietterich  95].  (I  will  discuss  the  scheduling  application  in 
detail  in  Section  7.2.)  Nevertheless,  it  is  important  to  note  that  when  function  ap¬ 
proximators  are  used,  optimistic  TD(A)  provides  no  guarantees  of  optimality.  The 
following  paragraphs  summarize  the  current  convergence  results  for  value  function 
approximation.  For  both  the  prediction  learning  (approximating  and  control 
learning  (approximating  V*)  tasks,  the  relevant  questions  are  (1)  do  the  available 
algorithms  converge,  and  (2)  if  so,  how  good  are  their  resulting  approximations? 

We  first  consider  the  case  of  approximating  the  policy  value  function  of  a  fixed 
policy  n.  The  TD(A)  family  of  algorithms  applies  here.  When  A  =  1,  TD(A)  reduces  to 
performing  stochastic  gradient  descent  to  minimize  the  squared  difference  between  the 
approximated  predictions  and  the  observed  simulation  outcomes.  Under  standard 
conditions,  using  any  parametric  function  approximator,  this  will  converge  to  a  local 
optimum  of  the  squared-error  function.  For  sufficiently  small  A,  however,  TD(A)  may 
diverge  when  nonlinear  function  approximators  are  used  [Bertsekas  and  Tsitsiklis  96]. 
Only  in  the  case  where  the  function  approximator  is  a  linear  architecture  over  state 
features  has  TD(A)  been  proven  to  converge  for  arbitrary  A  [Tsitsiklis  and  Roy  96]. 
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Optimistic  TD(A)  for  value  function  approximation: 

Given: 

•  a  simulation  model  for  MDP  X\ 

•  a  function  approximator  V{x)  parametrized  by  weight  vector  w; 

•  a  sequence  of  step  sizes  oi,  0:25  •  •  •  for  incremental  weight  updating;  and 

•  a  parameter  A  €  [0, 1]. 

Output:  a  weight  vector  w  such  that  V{x)  pa  V*{x). 


Set  w  :=  0  (or  an  arbitrary  initial  estimate). 
for  n  :=  1,2, ...  do:  { 

1.  Using  the  greedy  policy  for  the  current  evaluation  function  V  (see  Eq.  2.5), 
generate  a  trajectory  from  a  start  state  in  X  to  termination: 

— >■  •  •  •  — >•  XT-  Record  the  rewards  tq,  ti,  . . .  rx  received  at  each  step. 

2.  Update  the  weights  of  V  from  the  trajectory  as  follows: 
for  i  :=  T  downto  0,  do:  { 

^  I  rx  (the  terminal  reward)  if  i  =  T 

targj-  ^  ^ 

(r,-  +  A  •  targ,-^j  +  (1  —  A)  •  U (arj+i)  otherwise. 

Update  U’s  weights  by  delta  rule:  w  :=  w  +  a„(targi  -  V{xi))Vj/{xi). 


Table  2.1.  Optimistic  TD(A)  for  undiscounted  value  function  approximation  in  an 
absorbing  MDP.  This  easy-to-implement  version  performs  updates  after  the  termina¬ 
tion  of  each  trajectory.  For  an  incremental  version  that  performs  updates  after  each 
transition,  refer  to  [Sutton  87].  In  practice,  trajectories  are  often  generated  using  a 
mixture  of  greedy  moves  and  exploration  moves  [Thrun  92,  Singh  et  al.  98] . 
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A  useful  error  bound  has  also  been  shown  in  the  linear  case:  the  resulting  fit  is  worse 
than  the  best  possible  linear  fit  by  a  factor  of  at  most  (1  —  7A)/(1  —  7),  assuming  a 
discount  factor  of  7  <  1  [Tsitsiklis  and  Roy  96].  This  implies  that  TD(1)  is  guaranteed 
to  produce  the  best  fit,  but  the  bound  quickly  deteriorates  as  X  decreases.  The  same 
qualitative  conclusion  applies  (though  the  formula  for  the  bound  is  more  complex) 
for  7  =  1  [Bertsekas  and  Tsitsiklis  96]. 

We  now  proceed  to  the  harder  problem  of  approximating  the  optimal  value  func¬ 
tion  V*.  First,  independent  of  how  we  construct  it,  is  an  approximate  value  function 
V  useful  for  deriving  a  decision-making  policy?  Singh  and  Yee  [94]  show  that  if  V 
differs  from  V*  by  at  most  e  at  any  state,  then  the  expected  return  of  the  greedy 
policy  for  V  will  be  worse  than  that  of  the  optimal  policy  by  a  factor  of  at  most 
2'yt/{l  —7).  A  similar  result  holds  in  the  undiscounted  case,  assuming  all  policies  are 
proper  (7  is  then  replaced  by  a  contraction  factor  in  a  suitably  weighted  max  norm). 
This  bound  is  not  particularly  comforting,  since  1/(1  —  7)  will  be  large  in  practical 
applications,  but  at  least  it  guarantees  that  policies  cannot  be  arbitrarily  bad. 

How  should  we  construct  V*?  In  general,  algorithms  based  on  value  iteration’s 
one-step-backup  operator,  such  as  optimistic  TD(A),  use  function  approximator  pre¬ 
dictions  to  assign  new  training  values  for  that  same  function  approximator — a  re¬ 
cursive  process  that  may  propagate  and  enlarge  approximation  errors,  leading  to  pa¬ 
rameter  divergence.  I  have  demonstrated  empirically  that  such  divergence  can  indeed 
happen  when  offline  value  iteration  is  combined  with  commonly  used  function  approx¬ 
imators,  such  as  polynomial  regression  and  neural  networks  [Boyan  and  Moore  95]. 
Small  illustrative  examples  of  divergence  have  also  been  demonstrated  [Baird  95,  Gor¬ 
don  95] .  Sutton  has  argued  that  certain  of  these  instabilities  may  be  prevented  by 
sampling  states  along  simulated  trajectories,  as  optimistic  TD(A)  does  [Sutton  96]; 
but  there  are  no  convergence  proofs  of  this  as  yet. 

Parameter  divergence  in  offline  value  iteration  can  provably  be  prevented  by  us¬ 
ing  function  approximators  belonging  to  the  class  of  averagers,  such  as  fc-nearest- 
neighbor  [Gordon  95].  However,  this  class  excludes  practical  function  approximators 
which  extrapolate  trends  beyond  their  training  data  (e.g.,  global  or  local  polyno¬ 
mial  regression,  neural  networks).  Residual  algorithms,  which  attempt  to  blend  opti¬ 
mistic  TD(A)  with  a  direct  minimization  of  the  residual  approximation  errors  in  the 
Bellman  equation,  are  guaranteed  stable  with  arbitrary  parametric  function  approx¬ 
imators  [Baird  95];  these  methods  are  promising  but  as  yet  unproven  on  real-world 
problems. 
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2.1.3  Working  Backwards 

Value  iteration  (VI)  computes  V*  by  repeatedly  sweeping  over  the  state  space,  ap¬ 
plying  Equation  2.4  as  an  assignment  statement  (this  is  called  a  “one-step  backup”) 
at  each  state  in  parallel.  Suppose  the  lookup  table  is  initialized  with  all  O’s.  Then 
after  the  sweep  of  VI,  the  table  will  store  the  maximum  expected  return  of  a  path 
of  length  i  from  each  state.  For  so-called  stochastic  shortest  path  problems  in  which 
every  trajectory  produced  by  the  optimal  policy  inevitably  terminates  in  an  absorbing 
state  [Bertsekas  and  Tsitsiklis  96],  this  corresponds  to  the  intuition  that  VI  works  by 
propagating  correct  V*  values  backwards,  by  one  step  per  iteration,  from  the  terminal 
states. 

I  have  explored  the  efficiency  and  robustness  gains  possible  when  VI  is  modified 
to  take  advantage  of  the  working-backwards  intuition.  There  are  two  main  classes 
of  MDPs  for  which  correct  V*  values  can  be  assigned  by  working  strictly  backwards 
from  terminal  states: 

1.  deterministic  domains  with  no  positive-reward  cycles  and  with  every  state  able 
to  reach  at  least  one  terminal  state.  This  class  includes  shortest-path  and 
minimum  “cost-to-go”  problems  [Bertsekas  and  Tsitsiklis  96]. 

2.  (possibly  stochastic)  acyclic  domains:  domains  where  no  trajectory  can  pass 
through  the  same  state  twice.  Many  problems  naturally  have  this  property  (e.g., 
board-filling  games  like  tic-tac-toe  and  Connect-Four,  industrial  scheduling  as 
described  in  Section  2.3.4  below,  and  any  finite-horizon  problem  for  which  time 
is  a  component  of  the  state). 

Using  VI  to  solve  MDPs  belonging  to  either  of  these  special  classes  can  be  quite 
inefficient,  since  VI  performs  backups  over  the  entire  space,  whereas  the  only  back¬ 
ups  useful  for  improving  V*  are  those  on  the  “frontier”  between  already-correct  and 
not-yet-correct  V*  values.  In  fact,  for  small  problems  there  are  classical  algorithms 
for  both  problem  classes  which  compute  V*  more  efficiently  by  explicitly  working 
backwards:  for  the  deterministic  class,  Dijkstra’s  shortest-path  algorithm;  and  for  the 
acyclic  class,  DIRECTED- Acyclic-Graph-Shortest- PATHS  (DAG-SP)  [Cormen  et 
al.  90].^  DAG-SP  first  topologically  sorts  the  MDP,  producing  a  linear  ordering  of 
the  states  in  which  every  state  x  precedes  all  states  reachable  from  x.  Then,  it  runs 
through  that  list  in  reverse,  performing  one  backup  per  state.  Worst-case  bounds  for 
VI,  Dijkstra,  and  DAG-SP  in  deterministic  domains  with  X  states  and  A  actions  per 
state  are  O(AA^),  O(AVlogA),  and  0(AA),  respectively. 

^Although  Cormen  et  al.  [90]  present  DAG-SP  only  for  deterministic  acyclic  problems,  it  applies 
straightforwardly  to  the  stochastic  case. 
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Another  difference  between  VI  and  working  backwards  is  that  VI  repeatedly  re- 
estimates  the  values  at  every  state,  using  old  predictions  to  generate  new  training 
values.  By  contrast,  Dijkstra  and  DAG-SP  are  always  explicitly  aware  of  which  states 
have  their  V*  values  already  known,  and  can  hold  those  values  fixed.  This  distinction 
is  important  in  the  context  of  generalization  and  the  possibility  of  approximation 
error. 

In  sum,  I  have  presented  two  reasons  why  working  strictly  backwards  may  be 
desirable:  efficiency,  because  updates  need  only  be  done  on  the  “frontier”  rather 
than  all  over  state  space;  and  robustness,  because  correct  V*  values,  once  assigned, 
need  never  again  be  changed.  I  have  therefore  investigated  generalizations  of  the 
Dijkstra  and  DAG-SP  algorithms  specifically  modified  to  accommodate  huge  state 
spaces  and  value  function  approximation.  My  variant  of  Dijkstra’s  algorithm,  called 
Grow-Support,  was  presented  in  [Boyan  and  Moore  95]  and  is  summarized  briefly 
in  Section  2.2.  My  variant  of  DAG-SP  is  an  algorithm  called  ROUT  [Boyan  and 
Moore  96],  which  I  describe  in  more  detail  and  with  new  results  in  Section  2.3.  Other 
researchers  have  also  investigated  learning  control  by  working  backwards,  notably 
Atkeson  [94]  for  the  case  of  deterministic  domains  with  continuous  dynamics. 

2.2  VFA  in  Deterministic  Domains:  “Grow-Support” 

This  section  summarizes  Grow-Support,  an  algorithm  for  value  function  approxima¬ 
tion  in  large,  deterministic,  minimum-cost-to-goal  domains  [Boyan  and  Moore  95]. 
Grow-Support  is  designed  to  construct  the  optimal  value  function  with  a  generalizing 
function  approximator  while  remaining  robust  and  stable.  It  recognizes  that  function 
approximators  cannot  always  be  relied  upon  to  fit  the  intermediate  value  functions 
produced  by  value  iteration.  Instead,  it  assumes  only  that  the  function  approximator 
can  represent  the  final  V*  function  accurately,  if  given  accurate  training  values  for  a 
prespecified  collection  of  sample  states.  The  specific  principles  of  Grow-Support  are 
as  follows: 

1.  We  maintain  a  “support”  subset  of  sample  states  whose  final  V*  values  have 
been  computed,  starting  with  terminal  states  and  then  growing  backward  from 
there.  The  fitter  V  is  trained  only  on  these  values,  which  we  assume  it  is  capable 
of  fitting. 

2.  Instead  of  propagating  values  by  one-step  backups,  we  use  rollouts — simulated 
trajectories  guided  by  the  current  greedy  policy  on  V.  They  explicitly  verify  the 
achievability  of  a  state’s  estimated  future  reward  before  that  state  is  added  to 
the  support  set.  In  a  rollout,  the  new  V  training  value  is  derived  from  rewards 
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along  an  actual  path  to  the  goal,  not  from  the  predictions  made  by  the  previous 
iteration’s  function  approximation.  This  prevents  divergence. 

3.  We  take  maximum  advantage  of  generalization.  On  each  iteration,  we  add  to 
the  support  set  any  sample  state  that  can,  by  executing  a  single  action,  reach  a 
state  that  passes  the  rollout  test.  In  a  discrete  environment,  this  would  cause  the 
support  set  to  expand  in  one-step  concentric  “shells”  back  from  the  goal,  similar 
to  Dijkstra’s  algorithm.  But  in  the  continuous  case,  the  function  approximator 
may  be  able  to  extrapolate  correctly  well  beyond  the  support  region — and  when 
this  happens,  we  can  add  many  points  to  the  support  set  at  once.  This  leads  to 
the  very  desirable  behavior  that  the  support  set  grows  in  big  jumps  in  regions 
where  the  value  function  is  smooth. 


Grow-Support(X,  G,  A,  NextState,  Cost,  y): 

Given:  •  a  finite  collection  of  states  X  =  ^Xi,X2,  ■ . .  sampled  from  the 
continuous  state  space  ^  C  3?”,  and  goal  region  G  C  X 

•  a  finite  set  of  allowable  actions  A 

•  a  deterministic  transition  function  NextState  :  X  x  A X 

•  the  1-step  cost  function  COST  :  X  x  A  — > 

•  a  smoothing  function  approximator  V 

•  a  tolerance  level  e  for  value  function  approximation  error 

Support  :=  {(a;,  0)  |  xi  e  G) 

repeat 

Train  V  to  approximate  the  training  set  Support 
for  each  Xi  ^  SUPPORT  do 

c  :=  minoe^i  |^CoST(a;i,  a)  -|-  ROLLOUTCoST(NEXTSTATE(a;j,  a),  V) 

if  c  <  oo  then 

add  («,•  i-f  c)  to  the  training  set  SUPPORT 
until  Support  stops  growing  or  includes  all  sample  points. 

ROLLOUTCoST(state  X,  fitter  V): 

Starting  from  x,  simulate  the  greedy  policy  defined  by  value  function  V  until 
either  reaching  the  goal,  or  exceeding  a  total  path  cost  of  V{x)  -|-  e. 

Then  return: 

— V  the  actual  total  cost  of  the  path,  if  goal  is  reached  with  cost  <V{x)  +  e; 
— y  oo,  if  goal  is  not  reached  in  cost  V{x)  -f-  e. 


Table  2.2.  The  Grow-Support  algorithm  and  RolloutCost  subroutine 
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The  algorithm  is  sketched  in  Table  2.2.  In  a  series  of  experiments  reported  in 
[Boyan  and  Moore  95],  I  found  that  Grow-Support  is  more  robust  than  value  iteration 
with  function  approximation.  (Several  follow-up  studies  provide  additional  insight 
into  value  iteration’s  potential  for  divergence  [Gordon  95, Sutton  96].)  Grow-Support 
was  also  seen  to  be  no  more  computationally  expensive,  and  often  much  cheaper, 
despite  the  overhead  of  performing  rollouts.  Reasons  for  this  include:  (1)  the  rollout 
test  is  not  expensive;  (2)  once  a  state  has  been  added  to  the  support,  its  value  is 
fixed  and  it  needs  no  more  computation;  and  most  importantly,  (3)  the  aggressive 
exploitation  of  generalization  enables  the  algorithm  to  converge  in  very  few  iterations. 

It  is  easy  to  prove  that  Grow-Support  will  always  terminate  after  a  finite  number 
of  iterations.  If  the  function  approximator  is  inadequate  for  representing  the  V* 
function,  Grow-Support  may  terminate  before  adding  all  sample  states  to  the  support 
set.  When  this  happens,  we  then  know  exactly  which  of  the  sample  states  are  having 
trouble  and  which  have  been  learned.  This  suggests  potential  schemes  for  adaptively 
adding  sample  states  in  problematic  regions.  The  ROUT  algorithm,  described  next, 
does  adaptively  generate  its  own  set  of  sample  states  for  learning. 


2.3  VFA  in  Acyclic  Domains:  “ROUT” 

As  Grow-Support  scaled  up  Dijkstra’s  algorithm  for  deterministic  domains,  ROUT 
aims  to  scale  up  DAG-Shortest-Paths  (DAG-SP)  for  stochastic,  acyclic  domains.  In 
large  combinatorial  spaces  requiring  function  approximation,  DAG-SP’s  key  prepro¬ 
cessing  step — topologically  sorting  the  entire  state  space — is  no  longer  tractable.  In¬ 
stead,  ROUT  must  expend  some  extra  effort  to  identify  states  on  the  current  frontier. 
Once  identified  (as  described  below),  a  frontier  state  is  assigned  its  optimal  V*  value 
by  a  simple  one-step  backup,  and  this  {state— value}  pair  is  added  to  a  training  set 
for  a  function  approximator.  I  determine  the  training  value  by  a  one-step  backup 
rather  than  rollouts  because,  unlike  the  deterministic  MDPs  to  which  Grow-Support 
applies,  stochastic  MDPs  would  require  performing  not  one  but  many  rollouts  for 
accurate  value  determination.  However,  ROUT  still  does  use  an  analogue  of  Grow- 
Support ’s  “rollout  test”  to  identify  the  states  at  which  the  one-step  backup  may  be 
safely  applied. 

In  sum,  rout’s  main  loop  consists  of  identifying  a  frontier  state;  determining  its 
V*  value;  and  retraining  the  approximator.  The  training  set,  constructed  adaptively, 
grows  backwards  from  the  goal.  HuntFrontierState  is  the  key  subroutine  ROUT 
uses  to  identify  a  good  state  to  add  to  the  training  set.  The  criteria  for  such  a  state 
X  are  as  follows: 
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All  states  reachable  from  x  should  already  have  their  V*  values  correctly  ap¬ 
proximated  by  the  function  approximator.  This  ensures  that  the  policy  from  x 
onward  is  optimal,  and  that  a  correct  target  value  for  V*{x)  can  be  assigned. 

X  itself  should  not  already  have  its  V*  value  correctly  approximated.  This 
condition  aims  to  keep  the  training  set  as  small  as  possible,  by  excluding  states 
whose  values  are  correct  anyway  thanks  to  good  generalization. 

X  should  be  a  state  that  we  care  to  learn  about.  For  that  reason,  ROUT 
considers  only  states  which  occur  on  trajectories  emanating  from  a  starting 
state  of  the  MDP. 

The  HuntFrontierState  operation  returns  a  state  which  with  high  probability 
satisfies  these  properties.  It  begins  with  some  state  x  and  generates  a  number  of 
trajectories  from  x,  each  time  checking  to  see  whether  all  states  along  the  trajectory 
are  self-consistent  (i.e.,  satisfy  Equation  2.4  to  some  tolerance  e).  If  all  states  after 
X  on  all  sample  trajectories  are  self-consistent,  then  x  is  deemed  ready,  and  ROUT 
will  add  X  to  its  training  set.  If,  on  the  other  hand,  a  trajectory  from  x  reveals  any 
inconsistencies  in  the  approximated  value  function,  then  we  flag  that  trajectory’s  last 
such  inconsistent  state,  and  restart  HuntFrontierState  from  there.  Table  2.3 
specifies  the  algorithm,  and  Figure  2.3  illustrates  how  the  routine  works. 

The  parameters  of  the  ROUT  algorithm  are  iJ,  the  number  of  trajectories  gen¬ 
erated  to  certify  a  state’s  readiness,  and  e,  the  tolerated  Bellman  residual.  ROUT’s 
convergence  to  the  optimal  V*,  assuming  the  function  approximator  can  fit  the  V* 
training  set  perfectly,  can  be  guaranteed  in  the  limiting  case  where  JT  ->  oo  (assuring 
exploration  of  all  states  reachable  from  x)  and  e  =  0.  In  practice,  of  course,  we  want 
to  be  tolerant  of  some  approximation  error.  Typical  settings  I  used  were  H  =  20  and 
e  =  roughly  5%  of  the  range  of  U*. 

The  following  sections  present  experimental  results  with  ROUT  on  three  domains: 
a  prediction  task,  a  two-player  dice  game,  and  a  A:-armed  bandit  problem.  For  all 
problems,  I  compare  ROUT’s  performance  with  that  of  optimistic  TD(A)  given  the 
equivalent  function  approximator.  I  measure  the  time  to  reach  best  performance 
(in  terms  of  total  number  of  state  evaluations  performed)  and  the  quality  of  the 
learned  value  function  (in  terms  of  Bellman  residual,  closeness  to  the  true  V*,  and 
performance  of  the  greedy  control  policy).  The  results  show  that  ROUT  learned 
evaluation  functions  which  were  as  good  or  better  than  those  learned  by  TD(A),  and 
used  an  order  of  magnitude  less  training  data  in  doing  so.  I  also  report  preliminary 
results  on  a  fourth  domain,  a  simplified  production  scheduling  task. 
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Figure  2.1.  A  schematic  of  ROUT  working  on  an  acyclic  two-dimensional  navi¬ 
gation  domain,  where  the  allowable  actions  are  only  — /^,  and  t-  Suppose  that 
ROUT  has  thus  far  established  training  values  for  V*  at  the  triangles,  and  that  the 
function  approximator  has  successfully  generalized  V*  throughout  the  shaded  region. 
Now,  when  HuntFrontierState  generates  a  trajectory  from  the  start  state  to 
termination  (solid  line),  it  finds  that  several  states  along  that  trajectory  are  incon¬ 
sistent  (marked  by  crosses).  The  last  such  cross  becomes  the  new  starting  point  for 
HuntFrontierState.  From  there,  all  trajectories  generated  (dashed  lines)  are  fully 
self-consistent,  so  that  state  gets  added  to  ROUT’s  training  set.  When  the  function 
approximator  is  re-trained,  the  shaded  region  of  validity  should  grow,  backwards  from 
the  goal. 
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ROUT(start  states  X,  fitter  V): 

/*  Assumes  that  the  world  model  MDP  is  known  and  acyclic.  */ 
initialize  training  set  S  :=  0,  and  V  an  arbitrary  fit; 
repeat: 

for  each  start  state  x  ^  X  not  yet  marked  “done”,  do: 
s  :=  HuNTFRONTIERSTATE(a;,  V)\ 

add  {s  !->•  one-step-backup (s)}  to  training  set  S  and  re-train  fitter  F  on  5; 
if  (s  =  a:),  then  mark  staxt  state  x  as  “done”, 
until  all  start  states  in  X  are  marked  “done” . 

HUNTFRONTIERSTATE(state  a,  fitter  V): 

/*  If  the  value  function  is  self-consistent  on  all  trajectories  from  x,  return 
X.  (That  is  determined  probabilistically  by  Monte  Carlo  trials.)  Other¬ 
wise,  return  a  state  on  a  trajectory  from  x  for  which  the  self-consistency 
property  is  true.  */ 
for  each  legal  action  a  G  A(x),  do: 
repeat  up  to  H  times: 

generate  a  trajectory  T  from  x  to  termination,  starting  with  action  o; 
let  y  be  the  last  state  on  T  with  Bellman  residual  >  e; 
if  {y  ^  0)  and  {y  ^  a:),  then  break  out  of  loops,  and 
restart  procedure  with  HuNTFRONTlERSTATE(y,  V). 

/*  reaching  this  point,  x ’s  subtree  is  deemed  all  self-consistent  and  correct!  * / 
return  x. 


Table  2.3.  The  ROUT  algorithm  and  the  HuntFrontierState  subroutine 
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2.3.1  Task  1:  Stochastic  Path  Length  Prediction 

The  “Hopworld”  is  a  small  domain  designed  to  illustrate  how  ROUT  combines  working 
backwards,  adaptive  sampling  and  function  approximation.  The  domain  is  an  acyclic 
Markov  chain  of  13  states  in  which  each  state  has  two  equally  probable  successors: 
one  step  to  the  right  or  two  steps  to  the  right.  The  transition  rewards  are  such 
that  for  each  state  V*{n)  =  —2n.  Our  function  approximator  V  makes  predictions 
by  interpolating  between  values  at  every  fourth  state.  This  is  equivalent  to  using  a 
linear  approximator  over  the  four-element  feature  vector  representation  depicted  in 
Figure  2.2. 


Figure  2.2.  The  Hopworld  Markov  chain.  Each  state  is  represented  by  a  four- 
element  feature  vector  as  shown.  The  function  approximator  is  linear. 

In  ROUT,  we  fit  the  training  set  using  a  batch  least-squares  fit.  In  TD,  the  coef¬ 
ficients  are  updated  using  the  delta  rule  with  a  hand-tuned  learning  rate.  The  results 
are  shown  in  Table  2.4.  ROUT’s  performance  is  efficient  and  predictable  on  this 
contrived  problem.  At  the  start,  HuntFrontierState  finds  V  is  inconsistent  and 
trains  U(l)  and  U(2)  to  be  -2  and  -4,  respectively.  Linear  extrapolation  then  forces 
states  3  and  4  to  be  correct.  On  the  third  iteration,  U(5)  is  spotted  as  inconsistent 
and  added  to  the  training  set,  and  beneficial  extrapolation  continues.  By  compari¬ 
son,  TD  also  has  no  trouble  learning  V*,  but  requires  many  more  evaluations.  This  is 
because  TD  trains  blindly  on  all  transitions,  not  only  the  useful  ones;  and  because  its 
updates  must  be  done  with  a  fairly  small  learning  rate,  since  the  domain  is  stochastic. 
TD  could  be  improved  by  an  adaptive  learning  rate,  or  better  yet,  by  eliminating  its 
learning  rates  and  performing  Least-Squares  TD  as  described  later  in  Section  6.1.2. 

2.3.2  Task  2:  A  Two-Player  Dice  Game 

“Pig”  is  a  two-player  children’s  dice  game.  Each  player  starts  with  a  total  score  of 
zero,  which  is  increased  on  each  turn  by  dice  rolling.  The  first  to  100  wins.  On  her 
turn,  a  player  accumulates  a  subtotal  by  repeatedly  rolling  a  6-sided  die.  If  at  any 
time  she  rolls  a  1,  however,  she  loses  the  subtotal  and  gets  only  1  added  to  her  total. 
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Thus,  before  each  roll,  she  must  decide  whether  to  (a)  add  her  currently  accumulated 
subtotal  to  her  permanent  total  and  pass  the  turn  to  the  other  player;  or  (b)  continue 
rolling,  risking  an  unlucky  1. 

Pig  belongs  to  the  class  of  symmetric,  alternating  Markov  games.  This  means 
that  the  minimax-optimal  value  function  can  be  formulated  as  the  unique  solution 
to  a  system  of  generalized  Bellman  equations  [Littman  and  Szepesvari  96]  similar  to 
Equation  2.4.  The  state  space,  with  two-player  symmetry  factored  out,  has  515,000 
positions — large  enough  to  be  interesting,  but  small  enough  that  computing  the  exact 
V*  is  tractable. 

For  input  to  the  function  approximator,  we  fepresent  states  by  their  natural  3- 
dimensional  feature  representation:  X’s  total,  O’s  total,  and  X’s  current  subtotal. 
The  approximator  is  a  standard  MLP  with  two  hidden  units.  In  ROUT,  the  network 
is  retrained  to  convergence  (at  most  1000  epochs)  each  time  the  training  set  is  aug¬ 
mented.  Note  that  this  extra  cost  of  ROUT  is  not  reflected  in  the  results  table,  but 
for  practical  applications,  a  far  faster  approximator  than  backpropagation  would  be 
used  with  ROUT.^ 

The  Pig  results  are  charted  in  Table  2.4  and  graphed  in  Figure  2.3.  The  graph 
shows  the  learning  curves  for  the  best  single  trial  of  each  of  six  classes  of  runs:  TD(0), 
TD(0.8)  and  TD(1),  with  and  without  exploration.  (The  vertical  axis  measures  per¬ 
formance  in  expected  points  per  game  against  the  minimax  optimal  player,  where 
-f-l  point  is  awarded  for  a  win  and  —1  for  a  loss.)  The  best  TD  run,  TD(0)  with 
exploration,  required  about  30  million  evaluations  to  reach  its  best  performance  of 
about  —0.15.  By  contrast,  ROUT  completed  successfully  in  under  1  million  evalua¬ 
tions,  and  performed  at  the  significantly  higher  level  of  —0.09.  ROUT’s  adaptively 
generated  training  set  contained  only  133  states. 

2.3.3  Task  3:  Multi-armed  Bandit  Problem 

Our  third  test  problem  is  to  compute  the  optimal  policy  for  a  finite-horizon  fc-armed 
bandit  [Berry  and  Fristedt  85].  While  an  optimal  solution  in  the  infinite-horizon 
case  can  be  found  efficiently  using  Gittins  indices,  solving  the  finite-horizon  problem 
is  equivalent  to  solving  a  large  acyclic,  stochastic  MDP  in  belief  space  [Berry  and 
Fristedt  85].  I  show  results  for  fc  =  3  arms  and  a  horizon  of  n  =  25  pulls,  where 
the  resulting  MDP  has  736,281  states.  Solving  this  MDP  by  DAG-SP  produces  the 

^Unlike  TD,  which  works  only  with  parametric  function  approximators  for  which  VwV{x)  can  be 
calculated,  ROUT  can  work  with  arbitrary  function  approximators,  including  batch  methods  such 
as  projection-pursuit  and  locally  weighted  regression.  For  these  comparative  experiments,  however, 
we  used  linear  or  neural  network  fits  for  both  algorithms. 
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Problem 

Method 

training 

samples 

total 

evals 

RMS  RMS 

Bellman  ||U*-F|| 

policy 

quality 

HOP 

Discrete* 

12 

21 

0 

0 

-24  * 

ROUT 

4 

158 

0. 

0. 

-24 

TD(0) 

5000 

10,000 

0.03 

0.1 

-24 

TD(1) 

5000 

10,000 

0.03 

0.1 

-24 

PIG 

Discrete* 

515,000 

3.6M 

0 

0 

0  * 

ROUT 

133 

0.8M 

0.09 

0.14 

-0.093 

TD(0)  -f  explore 

5  M 

30  M 

0.23 

0.29 

-0.151 

TD(0.8)  explore 

9  M 

60  M 

0.23 

0.33 

-0.228 

TD(1)  -b  explore 

6  M 

40  M 

0.22 

0.30 

-0.264 

TD(0)  no  explore 

8-b  M 

50-b  M 

0.12 

0.54 

-0.717 

TD(0.8)  no  explore 

5  M 

35  M 

0.33 

0.44 

-0.308 

TD(1)  no  explore 

5  M 

30  M 

0.23 

0.32 

-0.186 

BAND 

Discrete* 

736,281 

4  M 

0 

0 

0.682  * 

ROUT 

30 

15,850 

0.01 

0.05 

0.668 

TD(0) 

150,000 

900,000 

0.07 

0.14 

0.666 

TD(1) 

100,000 

600,000 

0.02 

0.04 

0.669 

Table  2.4.  Summary  of  results.  For  each  algorithm  on  each  problem,  I  list  two 
measurements  of  time  to  quiescence  followed  by  three  measurements  of  the  solution 
quality.  The  measurements  for  TD  were  taken  at  the  time  when,  roughly,  best  perfor¬ 
mance  was  first  consistently  reached.  (Key:  M=10®;  *  denotes  optimal  performance 
for  each  task.) 


performance  vs.  optimal  opponent 
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numb6r  of  function  evaluations 


Figure  2.3.  Performance  of  Pig  policies  learned  by  TD  and  ROUT.  ROUT’s  per¬ 
formance  is  marked  by  a  single  diamond  at  the  top  left  of  the  graph. 
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optimal  exploration  policy,  which  has  an  expected  reward  of  0.6821  per  pull. 

Each  state  is  encoded  as  a  six-dimensional  feature  vector  of  [^succ^,.,,^!,  #fail„,,„ji, 
#fail^,.„,2)  ^nd  attempted  to  learn  a  neural  network 

approximation  to  V*  with  TD(0),  TD(1),  and  ROUT.  Again,  the  parameters  for  all 
algorithms  were  tuned  by  hand. 

The  results  are  shown  in  Table  2.4.  All  methods  do  spectacularly  well,  although 
the  TD  methods  again  require  more  trajectories  and  more  evaluations.  Careful  in¬ 
spection  of  the  problem  reveals  that  a  globally  linear  value  function,  extrapolated 
from  the  states  close  to  the  end,  has  low  Bellman  residual  and  performs  very  nearly 
optimally.  Both  ROUT  and  TD  successfully  exploit  this  linearity. 

2.3.4  Task  4:  Scheduling  a  Factory  Production  Line 

Production  scheduling,  the  problem  of  deciding  how  to  configure  a  factory  sequentially 
to  meet  demands,  is  a  critical  problem  throughout  the  manufacturing  industry.^  We 
assume  we  have  a  modest  number  of  products  (2-100)  and  must  produce  enough 
of  each  to  keep  warehouse  stocks  high  enough  to  satisfy  customer  requests  for  bulk 
shipments.  This  production  model  is  common,  for  example,  for  most  goods  found  in 
a  supermarket. 

An  instance  of  the  production  scheduling  problem  is  composed  of  five  parts: 

Machines  and  products.  This  is  a  list  of  what  machines  are  present  in  the  factory, 
and  what  products  can  be  made  on  the  machines.  There  may  be  complex 
constraints  such  as  “machine  A  can  only  make  product  1  when  machine  B  is 
not  making  product  3.”  A  complete,  legal  assignment  of  products  onto  the  set  of 
machines  is  called  a  configuration.  There  is  also  a  special  “closed”  configuration 
which  represents  a  decision  to  shut  the  factory  down. 

Changeover  times.  It  generally  takes  a  certain  amount  of  time  to  switch  the  factory 
from  one  configuration  to  another.  During  that  time,  there  is  no  production. 
The  problem  definition  includes  a  (possibly  stochastic)  estimate  of  how  long  it 
takes  to  change  each  configuration  to  each  other  configuration. 

Production  rates.  Each  configuration  produces  a  set  of  products  at  a  certain  rate. 
There  may  be  dependencies  between  the  machines.  For  example,  machine  B 
may  produce  product  2  faster  if  machine  A  is  also  producing  product  2.  The 
actual  production  rates  in  the  factory  may  be  very  stochastic;  for  example,  some 
machines  may  jam  frequently,  causing  irregular  delays  on  the  production  line. 

^The  application  of  RL  to  production  scheduling  reported  here  is  the  result  of  a  collaboration 
with  Jeff  Schneider  and  Andrew  Moore  [Schneider  et  al.  98]. 
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Figure  2.4.  Demand  inventory  curves  (left)  and  factory  layout  (right).  See  text  for 
further  explanation. 


Inventory  demand  curves.  At  the  time  a  schedule  is  created,  a  demand  curve  for 
each  product  is  available  from  a  corporate  marketing  and  forecasting  group.  As 
shown  in  Fig.  2.4,  each  curve  starts  at  the  left  with  the  current  inventory  of 
that  product.  The  inventory  decreases  over  time  as  future  product  shipments 
are  made  and  eventually  goes  below  zero  if  no  new  production  occurs.  To  avoid 
penalties,  the  scheduler  should  call  for  more  production  before  the  demand  curve 
falls  below  zero.  These  curves  may  also  change  over  time  as  new  information 
about  future  product  demand  becomes  available. 

Schedule  costs.  Running  a  schedule  generates  a  dollar  measure  of  net  profit  or  loss. 
This  includes  the  costs  of  running  the  factory,  paying  the  workers,  purchasing 
the  raw  materials,  and  carrying  inventory  at  the  warehouse.  It  also  includes 
heuristic  costs  such  as  an  estimate  of  the  damage  done  by  failing  to  fill  a  cus¬ 
tomer  request  when  the  warehouse  inventory  goes  to  zero.  Finally,  it  includes 
the  revenue  generated  from  selling  product  to  a  customer. 

Given  this  problem  description,  the  task  of  production  scheduling  is  to  maximize 
expected  profit  by  selecting  factory  configurations  over  a  period  of  time.  In  cases 
where  the  production  rates  and  demand  curves  are  assumed  deterministic,  the  prob¬ 
lem  reduces  to  finding  the  optimal  open-loop  schedule:  that  is,  find  a  fixed  sequence 
of  configurations  that  maximizes  profit.  In  the  general  stochastic  case,  the  optimal 
choice  of  configuration  at  time  t  will  depend  on  the  outcomes  of  earlier  configurations, 
so  the  optimal  solution  has  the  form  of  a  closed-loop  scheduling  policy. 

The  production  scheduling  problem  is  modelled  very  naturally  as  a  Markov  Deci¬ 
sion  Process,  as  follows: 

•  The  system  state  is  defined  by  the  current  time  <  €  0 . . .  T;  the  current  inventory 
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of  each  product  pi . .  .pN',  and,  if  there  are  configuration-dependent  changeover 
times,  the  current  factory  configuration. 

•  The  action  set  consists  of  all  legal  factory  configurations.  We  assume  a  discrete¬ 
time  model,  so  the  configuration  chosen  at  time  t  will  run  unchanged  until  time 
t  -f  1. 

•  The  stochastic  transition  function  applies  a  simulation  of  the  factory  to  com¬ 
pute  the  change  in  all  inventory  levels  realized  by  running  configuration  Ct  for  1 
timestep.  This  model  handles  random  variations  in  production  rates  straight¬ 
forwardly;  it  also  handles  changeover  times  by  simply  decreasing  production  in 
proportion  to  the  (possibly  stochastic)  downtime.  The  time  t  is  incremented  on 
each  step,  and  the  process  terminates  when  t  =  T. 

•  The  immediate  reward  function  is  computed  from  the  inventory  levels,  based 
on  the  demand  curve  at  time  t.  It  incorporates  the  revenues  from  production, 
penalties  from  late  production,  employee  costs,  operating  costs  and  changeover 
cost  incurred  during  the  period.  On  the  final  time  period  (transition  from 
t  =  T  —  1  to  T),  a  terminal  “reward”  assigns  additional  penalties  for  any 
outstanding  unsatisfied  demands. 

The  MDP  model  fully  represents  uncertainty  in  production  rates  and  changeover 
times.  As  defined  here,  the  model  also  handles  noise  in  the  demands  if  that  noise 
is  time-independent,  but  it  cannot  account  for  the  possibility  of  the  demand  curves 
being  randomly  updated  in  the  middle  of  a  schedule,  since  that  would  make  the  MDP 
transition  probabilities  nonstationary.  Finally,  since  the  current  time  t  is  included  as 
part  of  the  state,  the  MDP  is  acyclic:  ROUT  may  be  applied. 

I  applied  ROUT  to  a  highly  simplified  version  of  a  real  factory’s  scheduling  prob¬ 
lem.  The  task  involves  scheduling  8  weeks  of  production;  however,  configurations  may 
be  changed  only  at  2-week  intervals,  and  only  17  configuration  choices  are  available. 
Of  these  17,  nine  have  deterministic  production  rates;  the  other  eight  each  have  two 
stochastic  outcomes,  producing  only  1/3  of  their  usual  amount  with  probability  0.5. 
With  a  total  of9xl-|-8x2  =  25  outcomes  possible  from  every  state,  and  four 
scheduling  periods,  there  are  25^  =  390, 625  possible  trajectories  through  the  space. 
The  optimal  policy  can  be  computed  by  tabulating  V*{x)  at  every  possible  interme¬ 
diate  state  X  of  the  factory,  of  which  there  are  1  -|-  25  -t-  25^  -f-  25^  =  16, 276.  The 
optimal  policy  results  in  an  expected  cumulative  reward  of  — $22.8M.  By  contrast,  a 
random  schedule  attains  a  reward  of  — 923M.  A  greedy  policy,  which  at  each  step 
selects  a  configuration  to  maximize  only  the  next  period’s  profit,  attains  — $97.9M. 
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I  applied  ROUT  to  this  instance,  trying  two  different  memory-based  function 
approximators:  1-nearest  neighbor  and  locally  weighted  linear  regression  [Cleveland 
and  Devlin  88].  (The  local  regression  used  a  kernel  width  of  |  of  the  range  of  each 
input  dimension  in  the  training  data;  this  fraction  was  tuned  manually  over  powers  of 
\.)  Since  these  function  approximators  are  nonparametric,  TD(A)  cannot  be  used  to 
train  them,  so  ROUT  is  compared  only  to  the  optimal,  greedy,  and  random  policies. 
rout’s  exploration  and  tolerance  parameters  were  also  tuned  manually. 


Algorithm 

Mean  Profit  95%  C.I. 

optimal  runs 

Optimal 

-22.8 

1 

Random 

-923.2  ±58.7 

0 

Greedy 

-97.9  ±15.1 

0 

ROUT  -|-  nearest  neighbor 

N/A 

0 

ROUT  +  locally  weighted  linear 

-45.0  ±16.9 

10/16 

Table  2.5.  Results  on  the  production  scheduling  task 


Table  2.5  summarizes  the  results.  When  nearest-neighbor  was  used  as  the  function 
approximator,  ROUT  did  not  obtain  sufficient  generalization  from  its  training  set  and 
failed  to  terminate  within  a  limit  of  several  hours.  However,  with  a  locally  weighted 
regression  model,  ROUT  did  run  to  completion  and  produced  an  approximate  value 
function  which  significantly  outperformed  the  greedy  policy.  Moreover,  over  half  of 
these  runs  did  indeed  terminate  with  the  optimal  closed-loop  scheduling  policy.  In 
these  cases,  ROUT  s  final  self-built  training  set  for  value  function  approximation 
consisted  of  only  about  110  training  points — a  substantial  reduction  over  the  16,276 
required  for  full  tabulation  of  V*.  ROUT’s  total  running  time  («  1  hour  on  a  200  MHz 
Pentium  Pro)  was  roughly  half  of  that  required  to  enumerate  V*  manually. 

From  these  preliminary  results,  I  conclude  that  ROUT  does  indeed  have  the  po¬ 
tential  to  approximate  V*  extremely  well,  given  a  suitable  function  approximator 
for  the  domain.  However,  since  it  runs  quite  slowly  on  even  this  simple  problem, 
I  believe  ROUT  will  not  scale  up  to  practical  scheduling  instances  without  further 
modification. 

2.4  Discussion 

When  a  function  approximator  is  capable  of  fitting  U*,  ROUT  will,  in  the  limit,  find 
it.  However,  for  ROUT  to  be  efficient,  the  frontier  must  grow  backward  from  the  goal 
quickly,  and  this  depends  on  good  extrapolation  from  the  training  set.  When  good 
extrapolation  does  not  occur,  ROUT  may  become  stuck,  repeatedly  adding  points 
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near  the  goal  region  and  never  progressing  backwards.  Some  function  approximators 
may  be  especially  well-suited  to  ROUT’s  required  extrapolation  from  accurate  train¬ 
ing  data,  and  this  deserves  further  exploration.  Another  promising  refinement  would 
be  to  adapt  the  tolerance  level  e,  thereby  guaranteeing  progress  at  the  expense  of 
accuracy. 

Grow-Support  and  ROUT  represent  first  steps  toward  a  new  class  of  algorithms 
for  solving  large-scale  MDPs.  Their  primary  innovation  is  that,  without  falling  prey 
to  the  curse  of  dimensionality,  they  are  able  to  explicitly  represent  which  states  are 
already  solved  and  which  are  not  yet  solved.  Using  this  information,  they  “work 
backwards,”  computing  accurate  V*  values  at  targeted  unsolved  states  using  either 
function  approximator  predictions  at  solved  states  or  Monte  Carlo  rollouts.  Impor¬ 
tantly,  and  unlike  the  Dijkstra  and  DAG-SP  exact  algorithms  on  which  they  are 
based,  they  grow  the  solution  set  for  V*  back  from  the  goal  without  requiring  an 
explicit  backward  model  for  the  MDP.  Only  forward  simulations  are  used;  this  con¬ 
strains  the  distribution  of  visited  states  to  areas  that  are  actually  reachable  during 
task  execution.  By  treating  solved  and  unsolved  states  differently,  Grow-Support  and 
ROUT  eliminate  the  possibility  of  divergence  caused  by  repeated  value  re-estimation. 

This  chapter  has  reviewed  the  state  of  the  art  in  reinforcement  learning,  a  field 
which  is  grounded  solidly  in  the  theory  of  dynamic  programming  but  provides  few 
guarantees  for  the  practical  cases  where  function  approximators,  rather  than  lookup 
tables,  must  be  used  to  construct  the  value  function.  I  have  described  two  novel 
algorithms  for  approximating  V*  which  are  guaranteed  stable  and  which  perform 
well  in  practice.  In  the  remaining  chapters  of  this  thesis,  I  consider  the  simpler  VFA 
problem  of  approximating  for  a  fixed  policy  tt,  on  which  TD(A)  with  linear  function 
approximators  does  come  with  strong  convergence  guarantees — and  I  demonstrate  a 
practical  way  of  exploiting  these  approximations  to  improve  search  performance  on 
global  optimization  tasks. 


4 


41 


Chapter  3 

Learning  Evaluation  Functions  for 
Global  Optimization 

3.1  Introduction 

3.1.1  Global  Optimization 

Global  optimization — the  problem  of  finding  the  best  possible  configuration  from  a 
large  space  of  possible  configurations — is  among  the  most  fundamental  of  computa¬ 
tional  tasks.  Its  numerous  applications  in  science,  engineering,  and  industry  include 

•  design  and  layout:  optimizing  VLSI  circuit  designs  for  computer  architectures 
[Wong  et  al.  88],  packing  automobile  components  under  a  car  hood  [Szykman 
and  Cagan  95],  architectural  design,  magazine  page  layout 

•  resource  allocation:  airline  scheduling  [Subramanian  et  al.  94],  school  timetabling, 
factory  production  scheduling,  military  logistics  planning 

•  parameter  optimization:  generating  accurate  models  for  weather  prediction, 
ecosystem  modeling  [Duan  et  al.  92],  economic  modeling,  traffic  simulations, 
intelligent  database  querying  [Boyan  et  al.  96] 

•  scientific  analysis:  computational  biology  (gene  sequencing)  [Karp  97],  compu¬ 
tational  chemistry  (protein  folding)  [Neumaier  97] 

•  engineering:  medical  robotics  (radiotherapy  for  tumor  treatment)  [Webb  91], 
computer  vision  (line  matching)  [Beveridge  et  al.  96] 

Formally,  an  instance  of  global  optimization  consists  of  a  state  space  X  and  an 
objective  function  Obj  :  X  — >  5R.  The  goal  is  to  find  a  state  x*  E  X  which  minimizes 
Obj,  that  is,  Obj  (a:*)  <  Obj(ar)  Va;  E  X.  If  the  space  X  is  so  small  that  every  state 
can  be  evaluated,  then  obtaining  the  exact  solution  x*  is  trivial;  otherwise,  special 
knowledge  of  the  problem  structure  must  be  exploited.  For  example,  if  X  is  a  convex 
linearly  constrained  subset  of  3?"  and  Obj  is  linear,  then  x*  can  be  found  efficiently 
by  linear  programming. 
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But  for  many  important  optimization  problems,  efficient  solution  methods  are 
unknown.  Practical  combinatorial  optimization  problems  where  X  is  finite  but  enor¬ 
mous,  such  as  the  applications  listed  under  “design  and  layout”  and  “resource  alloca¬ 
tion”  above,  all  too  often  fall  into  the  class  of  NP-hard  problems,  for  which  efficient 
exact  algorithms  are  thought  not  to  exist  [Garey  and  Johnson  79].  Recent  progress 
in  cutting-plane  and  branch-and-bound  algorithms,  especially  as  applied  to  mixed- 
integer  linear  programming  and  Travelling  Salesperson  Problems,  has  led  to  exact 
solutions  for  some  large  NP-hard  problem  instances  [Applegate  et  al.  95,  Subrama- 
nian  et  al.  94].  However,  for  most  real-world  domains,  practitioners  resort  to  heuristic 
approximation  methods  which  seek  good  approximate  solutions. 

To  illustrate  the  discussion  that  follows,  I  will  present  an  example  optimization 
instance  from  the  domain  of  one-dimensional  bin-packing.  In  bin-packing,  we  are 
given  a  bin  capacity  C  and  a  list  L  =  (ui,  02,  ...On)  of  n  items.,  each  having  a  size 
s{ai)  >  0.  The  goal  is  to  pack  the  items  into  as  few  bins  as  possible,  i.e.,  partition 
them  into  a  minimum  number  m  of  subsets  81,82,... ,3^  such  that  for  each  8j, 
'l2aieBj  ^  This  problem  has  many  real-world  applications,  including  loading 
trucks  subject  to  weight  limitations,  packing  commercials  into  station  breaks,  and 
cutting  stock  materials  from  standard  lengths  of  cable  or  lumber  [Coffman  et  al.  96]. 
It  is  also  a  classical  NP-complete  optimization  problem  [Garey  and  Johnson  79]. 
Figure  3.1  depicts  a  small  bin-packing  instance  with  thirty  items.  Packed  optimally, 
these  items  fill  9  bins  exactly  to  capacity. 


Figure  3.1.  A  small  bin-packing  domain.  Left:  initial  state  (30  items,  each  in  its 
own  bin).  Right:  the  global  optimum  (nine  bins  packed  with  0  waste). 
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3.1.2  Local  Search 

Many  special-purpose  approximation  algorithms  have  been  proposed  and  analyzed 
for  the  bin-packing  problem  [Coffman  et  al.  96].  However,  we  will  focus  instead  on 
a  broad  class  of  general-purpose  algorithms  for  the  general  global  optimization  prob¬ 
lem:  the  class  of  local  search  algorithms.  All  of  these  work  by  defining  a  neighborhood 
N(xi)  C  X  for  each  state  Xi  €  X.  Usually,  the  neighborhood  of  x  consists  of  simple 
perturbations  to  x — states  whose  new  Obj  value  can  be  computed  incrementally  and 
efficiently  from  Obj  (a;).  However,  complex  neighborhood  structures  can  also  be  effec¬ 
tive  (e.g.,  the  Lin-Kernighan  algorithm  for  the  Traveling  Salesman  Problem  [Lin  and 
Kernighan  73]).  The  neighborhood  structure  and  objective  function  together  give 
rise  to  a  cost  surface  over  the  state  space.  Local  search  methods  start  at  some  initial 
state,  perhaps  chosen  randomly,  and  then  search  for  a  good  solution  by  iteratively 
moving  over  the  cost  surface  from  state  to  neighboring  state. 

The  simplest  local  search  method  is  known  variously  as  greedy  descent,  iterative 
improvement  or,  most  commonly,  hillclimbing}  Hillclimbing’s  essential  property  is 
that  it  never  accepts  a  move  that  worsens  Obj.  There  are  several  variants.  In  steepest- 
descent  hillclimbing,  every  neighbor  x'  €  N{x)  is  evaluated,  and  the  neighbor  with 
minimal  Obj(x')  is  chosen  as  the  successor  state  (ties  are  broken  randomly).  Steepest- 
descent  terminates  as  soon  as  it  encounters  a  local  optimum,  a  state  x  for  which 
Obj(a;)  <  Obj(a:')  Vr'  €  N{x).  In  another  variant,  stochastic  hillclimbing  (also  known 
as  first-improvement),  random  neighbors  of  x  are  evaluated  one  by  one,  and  the  first 
one  which  improves  on  Obj(a;)  is  chosen.  Whether  or  not  to  accept  equi-cost  moves,  to 
a  neighboring  state  of  equal  cost,  is  a  parameter  of  the  algorithm.  Another  parameter, 
called  patience,  governs  termination:  the  search  halts  after  patience  neighbors  have 
been  evaluated  consecutively  and  all  found  unhelpful.  Thus,  the  final  state  reached 
is  only  probabalistically,  not  with  certainty,  a  local  optimum.  Despite  this,  stochastic 
hillclimbing  is  often  preferred  for  its  ability  to  find  a  good  solution  very  quickly. 

Hillclimbing’s  obvious  weakness  is  that  it  gets  stuck  at  the  first  local  optimum 
it  encounters.  Alternative  local  search  methods  provide  a  variety  of  heuristics  for 
accepting  some  moves  to  worse  neighbors.  The  possibilities  include 

•  “Force-best-move”  approaches:  move  to  the  best  neighbor,  even  if  its  value  is 
worse.  (This  approach  has  been  successful  in  the  GSAT  algorithm  [Selman  and 
Kautz  93].) 

^In  this  dissertation,  I  will  use  the  term  “hillclimbing”  even  though  we  seek  a  minimum  of  the 
cost  surface.  Simply  imagine  that  the  goal  of  our  metaphorical  mountain  climber  is  to  minimize  his 
distance  from  the  sky. 
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•  More  generally,  “biased  random  walk”  approaches:  allow  moves  to  worse  neigh¬ 
bors  stochastically,  perhaps  depending  on  how  much  worse  the  neighbor  is. 
(This  approach  has  been  successful  in  the  WALKSAT  algorithm  [Selman  et 
al.  96]  and  others.) 

•  Simulated  annealing  [Kirkpatrick  et  al.  83].  This  popular  approach  is  like  the 
biased  random  walk,  but  gradually  lowers  the  probability  of  accepting  a  move 
to  a  worse  neighbor.  Search  terminates  at  a  local  optimum  after  the  proba¬ 
bility  falls  to  zero.  Simulated  annealing  approaches  are  discussed  in  detail  in 
Appendix  B. 

•  Multiple  restarting.  Because  stochastic  hillclimbing  is  so  fast,  a  reasonable 
approach  is  to  apply  it  repeatedly  and  return  the  best  result.  The  restart  can 
be  from  the  start  state,  a  randomly  chosen  state,  or  a  state  selected  according  to 
some  “smarter”  heuristic.  I  review  the  literature  on  smart  restarting  methods 
in  Section  7.1.  Multiple  restarting  is  effective  not  only  for  hillclimbing  but  for 
any  of  the  search  procedures  listed  above. 

In  all  of  these  procedures,  since  the  objective  function  value  does  not  improve  mono- 
tonically  over  time,  the  search  outcome  is  defined  to  be  the  best  state  ever  evaluated 
over  the  entire  trajectory. 

Local  search  approaches  are  easy  to  apply.  For  the  bin-packing  domain  described 
above,  a  solution  state  x  simply  assigns  a  bin  number  6(a,)  to  each  item.  Each  item 
is  initially  placed  alone  in  a  bin:  6(ai)  =  1,6(02)  =  2,...  ,6(0,1)  =  Neighbor¬ 
ing  states  can  be  generated  by  moving  any  single  item  o,  into  a  random  other  bin 
with  enough  spare  capacity  to  accommodate  it.  (For  details,  please  see  Section  4.2.) 
Figure  3.2  illustrates  the  solutions  discovered  by  three  independent  runs  of  stochastic 
hillclimbing  without  equi-cost  moves  on  the  example  instance  of  Figure  3.1.  The  local 
optima  shown  use  11,  14,  and  12  bins,  respectively.  In  400  further  runs,  hillclimbing 
produced  a  10-bin  solution  seven  times  but  never  the  optimal  9-bin  solution. 

3.1.3  Using  Additional  State  Features 

Local  search  has  been  likened  to  “trying  to  find  the  top  of  Mount  Everest  in  a  thick 
fog  while  suffering  from  amnesia”  [Russell  and  Norvig  95,  p.lll].  Whenever  the 
climber  considers  a  step,  he  consults  his  altimeter,  and  decides  whether  to  take  the 
step  based  on  how  his  altitude  has  changed.  But  suppose  the  climber  has  access  to 
not  only  an  altimeter,  but  also  additional  senses  and  instruments — ^for  example,  his  x 
and  y  location,  the  slope  of  the  ground  underfoot,  whether  or  not  he  is  on  a  trail,  and 
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the  height  of  the  nearest  tree.  These  additional  “features”  may  enable  the  climber  to 
make  a  more  informed,  more  foresightful,  evaluation  of  whether  to  take  a  step. 

In  real  optimization  domains,  such  additional  features  of  a  state  are  usually  plen¬ 
tiful.  For  example,  here  are  a  few  features  for  the  bin-packing  domain: 

•  the  variance  in  fullness  of  the  bins 

•  the  variance  in  the  number  of  items  in  each  bin 

•  the  fullness  of  the  least-full  bin 

•  the  average  ratio  of  largest  to  smallest  item  in  each  bin 

•  the  “raw”  state  fe(ai),  6(02), ... 

Features  like  these,  if  combined  and  weighted  correctly,  can  form  a  new  objective 
function  which  indicates  not  just  how  good  a  state  is  itself  as  a  final  solution,  but 
how  good  it  is  at  leading  search  to  other,  better  solutions. 

This  point  is  illustrated  in  Figure  3.3.  The  figure  plots  the  three  hillclimbing 
trajectories  which  produced  the  three  bin-packing  solutions  shown  above  (Figure  3.2). 
Each  visited  state  Xi  is  plotted  as  a  point  in  a  2-D  feature  space  where  feature  ^1  is 
simply  Obj(a;,)  and  feature  #2  is  the  variance  in  fullness  of  the  bins  under  packing 
Xi-  In  this  feature  space,  the  three  trajectories  all  begin  at  the  bottom  left,  which 
corresponds  to  the  initial  state  shown  in  Figure  3.1:  here,  Obj=  30  and  the  variance 
is  low,  since  all  the  bins  are  neaxly  empty.  The  trajectories  proceed  rightward,  as 
each  accepted  move  reduces  by  one  the  number  of  bins  used.  The  variance  first 
increases,  as  some  bins  become  fuller  than  others,  and  then  decreases  near  the  end  of 
the  trajectory  as  all  the  bins  become  rather  full.  The  trajectories  terminate  at  local 
minima  of  quality  11,  14,  and  12,  respectively. 
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Figure  3.3.  Three  stochastic  hillclimbing  trajectories  in  bin-packing  feature  space 


The  key  observation  to  make  in  Figure  3.3  is  that  the  variance  feature  can  help 
predict  the  future  search  outcome.  Apparently,  hillclimbing  trajectories  which  pass 
through  higher-variance  states  tend  to  reach  better-quality  solutions  in  the  end.  This 
makes  sense:  higher  variance  states  have  more  nearly  empty  bins,  which  can  be  more 
easily  emptied.  This  is  the  kind  of  knowledge  that  we  would  like  to  integrate  into  an 
improved  evaluation  function  for  local  search. 

It  is  well-known  that  extra  features  can  be  used  to  help  improve  the  searchability  of 
the  cost  surface.  In  simulated  annealing  practice,  engineers  often  spend  considerable 
effort  tweaking  the  coefficients  of  penalty  terms  and  other  additions  to  their  objective 
function.  This  excerpt,  from  a  book  on  VLSI  layout  by  simulated  annealing  [Wong 
et  al.  88],  is  typical: 

Clearly,  the  objective  function  to  be  minimized  is  the  channel  width  w. 
However,  w  is  too  crude  a  measure  of  the  quality  of  intermediate  solutions. 
Instead,  for  any  A^lid  partition,  the  following  cost  function  is  used: 

C  =  ixP'  -f  Ap  •  -f-  A{7  •  U  (3.1) 

where  ...  Ap  and  Xu  are  constants,  ... 
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In  this  application,  the  authors  hand-tuned  the  coefficients  and  set  Aj,  =  0.5,  Xu  = 
10.  (The  next  chapter  will  demonstrate  an  algorithm  that  discovered  that  much 
better  performance  can  be  achieved  by  assigning,  counterintuitively,  a  negative  value 
to  At/.)  Similar  examples  of  evaluation  functions  being  manually  configured  and 
tuned  for  good  performance  can  be  found  in,  e.g.,  [Cohn  92,  Szykman  and  Cagan 
95,  Falkenauer  and  Delchambre  92]. 

The  question  I  address  is  the  following:  can  extra  features  of  an  optimization 
problem  be  incorporated  automatically  into  improved  evaluation  functions,  thereby 
guiding  search  to  better  solutions? 

3.2  The  “STAGE”  Algorithm 

This  section  introduces  the  algorithm  which  is  the  main  contribution  of  this  disserta¬ 
tion.  STAGE  applies  the  methods  of  value  function  approximation  to  automatically 
analyze  sample  trajectories,  like  those  shown  above  in  Figure  3.3,  and  to  construct 
predictive  evaluation  functions.  It  then  uses  these  new  evaluation  functions  to  guide 
further  search.  STAGE  is  general,  principled,  simple,  and  efficient.  In  the  next  chap¬ 
ter,  I  will  also  demonstrate  empirically  that  it  is  successful  at  finding  high-quality 
solutions  to  large-scale  optimization  problems. 

3.2.1  Learning  to  Predict 

STAGE  aims  to  exploit  a  simple  observation:  the  performance  of  a  local  search  al¬ 
gorithm  depends  on  the  state  from  which  the  search  starts.  We  can  express  this 
dependence  in  a  mapping  from  starting  states  x  to  expected  search  result: 

V^(x)  expected  best  Obj  value  seen  on  a  trajectory  that  starts  (3.2) 
from  state  x  and  follows  local  search  method  it 

Here,  tt  represents  a  local  search  method  such  as  any  of  the  hillclimbing  variants 
or  simulated  annealing.  Formal  conditions  under  which  V"^  is  well-defined  will  be 
given  in  Section  3.4.1.  For  now,  the  intuition  is  most  important:  V^lx)  evaluates  x’s 
promise  as  a  starting  state  for  tt. 

For  example,  consider  minimizing  the  one-dimensional  function  Obj  (a;)  =  (|a:|  — 
10)  cos(27ra;)  over  the  domain  X  =  [—10,10],  as  depicted  in  Figure  3.4.  Assuming 
a  neighborhood  structure  on  this  domain  where  tiny  moves  to  the  left  or  right  are 
allowed,  hillclimbing  search  clearly  leads  to  a  suboptimal  local  minimum  for  all  but 
the  luckiest  of  starting  points.  However,  the  quality  of  the  local  minimum  reached 
does  correlate  strongly  with  the  starting  position  x,  making  it  possible  to  learn  useful 
predictions. 
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Figure  3.4.  Left:  Obj(a;)  for  a  one-dimensional  function  minimization  domain. 
Right:  the  value  function  V^^x)  which  predicts  hillclimbing’s  performance  on  that 
domain. 


We  seek  to  approximate  using  a  function  approximation  model  such  as  linear 
regression  or  multi-layer  perceptrons,  where  states  x  are  encoded  as  real-valued  fea¬ 
ture  vectors.  As  discussed  above  in  Section  3.1.3,  these  input  features  may  encode 
any  relevant  properties  of  the  state,  including  the  original  objective  function  Obj(a:) 
itself.  We  denote  the  mapping  from  states  to  features  by  F  :  A'  — >■  and  our 

approximation  of  V^^x)  by  V‘^{F{x)). 

Training  data  for  supervised  learning  of  may  be  readily  obtained  by  running 
TT  from  different  starting  points.  Moreover,  if  the  algorithm  tt  behaves  as  a  Markov 
chain— i.e.,  the  probability  of  moving  from  state  x  to  x'  is  the  same  no  matter  when 
X  is  visited  and  what  states  were  visited  previously — then  intermediate  states  of 
each  simulated  trajectory  may  also  be  considered  alternate  “starting  points”  for  that 
search,  and  thus  used  as  training  data  for  V'"  as  well.  This  insight  enables  us  to  get 
not  one  but  perhaps  hundreds  of  pieces  of  training  data  from  each  trajectory  sampled. 
Under  conditions  which  I  detail  in  Section  3.4.1  below,  all  of  the  local  search  methods 
mentioned  in  Section  3.1.2  have  the  Markov  property. 

Under  certain  additional  conditions,  detailed  in  Section  3.4.1,  the  function  V" 
can  be  shown  to  be  precisely  the  policy  value  function  of  a  Markov  chain.  It  is  then 
possible  to  apply  dynamic-programming-based  algorithms  such  as  TD(A),  which  may 
learn  more  efficiently  than  supervised  learning.  I  defer  a  detailed  discussion  of  this 
approach  until  Section  6.1.  For  the  remainder  of  this  and  the  next  two  chapters,  I 
will  assume  that  is  approximated  by  supervised  learning  as  outlined  above. 

The  state  space  X  is  huge,  so  we  cannot  expect  our  simulations  to  explore  any 
significant  fraction  of  it.  Instead,  we  must  depend  on  good  extrapolation  from  the 
function  approximator  if  we  are  to  learn  accurately.  Specifically,  we  hope  that 
the  function  approximator  will  predict  good  results  for  unexplored  states  which  share 
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many  features  with  training  states  that  performed  well.  If  is  fairly  smooth,  this 
hope  is  reasonable. 

3.2.2  Using  the  Predictions 

The  learned  evaluation  function  V'^{F(x))  evaluates  how  promising  a;  is  as  a  starting 
point  for  algorithm  tt.  To  find  the  best  starting  point,.  we.must  optimize  V'^  over  X. 
We  do  this  by  simply  applying  stochastic  hillclimbing  with  V'^  instead  of  Obj  as  the 
evaluation  function.^ 

The  “STAGE”  algorithm  provides  a  framework  for  learning  and  exploiting 
on  a  single  optimization  instance.  As  illustrated  in  Figure  3.5,  STAGE  repeatedly 
alternates  between  two  different  stages  of  local  search:  running  the  original  method 
TT  on  Obj,  and  running  hillclimbing  on  V'^  to  find  a  promising  new  starting  state  for 
TT.  Thus,  STAGE  can  be  viewed  as  a  smart  multi-restart  approach  to  local  search. 


Figure  3.5.  A  diagram  of  the  main  loop  of  STAGE 


A  compact  specification  of  the  algorithm  is  given  in  Table  3.2.2  (p.  52).  In  the 
remainder  of  this  section,  I  give  a  verbose  description  of  the  algorithm. 

stage’s  inputs  are  X,  S',  tt,  Obj,  ObjBound,  F,  Fit,  N,  Pat,  and  TotEvals: 

•  the  state  space  X 

•  starting  states  S  C  X  (and  a  method  for  generating  a  random  state  in  S) 

•  TT,  the  local  search  method  from  which  STAGE  learns,  tt  is  assumed  to  be 
Markovian  and  proper,  conditions  which  I  discuss  in  detail  in  Section  3.4.1. 

^Note  that  even  if  is  smooth  with  respect  to  the  feature  space — as  it  surely  will  be  if  we 
represent  with  a  simple  model  like  linear  regression — it  may  still  give  rise  to  a  complex  cost 
surface  with  respect  to  the  neighborhood  structure  on  X. 
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Intuitively,  it  is  easiest  to  think  of  the  prototypical  case,  hillclimbing.  Note 
that  TT  encapsulates  the  full  specification  of  the  method  including  all  its  internal 
parameters;  for  example,  tt  =  stochastic  hillclimbing,  rejecting  equi-cost  moves, 
with  patience  200. 

•  the  objective  function,  Obj  :  X  ^  3f?,  to  be  minimized 

•  a  lower  bound  on  Obj,  ObjBound  €  3?  (or  — oo  if  no  bound  is  known).  Its  use 
is  described  in  step  2d  below. 

•  a  featurizer  F  mapping  states  to  real-valued  features,  F  :  X 

•  Fit,  a  function  approximator.  Given  a  set  of  training  pairs  of  the  form  {{fa,  fi2,  •  •  •  ,  fw)  H- 
yi},  Fit  produces  a  real- valued  function  over  3?^  which  approximately  fits  the 

data. 

•  a  neighborhood  structure  N  :  X  -¥  2^  and  patience  parameter  PaT  for  running 
stochastic  hillclimbing  on  V'^ 

•  TotEvals,  the  number  of  state  evaluations  allotted  for  this  run. 

stage’s  output  is  a  single  state  x,  which  has  the  lowest  Obj  value  of  any  state 
evaluated  during  the  run.  It  also  outputs  the  final  learned  evaluation  function  V'^ , 
which  provides  interesting  insights  about  what  combination  of  features  led  to  good 
performance.  (If  those  insights  apply  generally  to  more  than  one  problem  instance, 
then  the  learned  V”  may  be  profitably  transferred,  as  I  show  in  Section  6.2.) 

STAGE  begins  by  initializing  xq  to  a  random  start  state.  The  main  loop  of  STAGE 
proceeds  as  follows: 

Step  2a:  Optimize  Obj  using  tt.  From  a:o,  run  search  algorithm  tt,  producing  a 
search  trajectory  (xq,  Xi,X2, . . .  ,  xt). 

Note  that  we  have  assumed  tt  is  a  proper  search  procedure:  the  trajectory  is 
guaranteed  to  terminate. 

Step  2b:  Train  .  For  each  point  Xi  on  the  search  trajectory,  define  yi  :=  minj=,-...7’  Obj(a;j), 
and  add  the  pair  {F{xi)  i-A  yi}  to  the  training  set  for  Fit.  Retrain  Fit;  call  the 
resulting  learned  evaluation  function  . 

We  accumulate  a  training  set  of  all  states  ever  visited  by  tt.  The  target  values 
yi  correspond  to  our  definition  of  V”  in  Equation  3.2  (p.  47).  Note  that  for 
hillclimbing-like  policies  tt  which  monotonically  improve  Obj,  yi  =  Obj{xT)  Vz. 

Section  3.4.3  below  shows  how  to  make  this  step  time-  and  space-efficient. 
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Step  2c:  Optimize  using  hillclimbing.  Continuing  from  a;T,  optimize  t/’^(F(x)) 
by  performing  a  stochastic  hillclimbing  search  over  the  neighborhood  structure 
N.  Cut  off  the  search  when  either  PAT  consecutive  moves  produce  no  im¬ 
provement,  or  a  candidate  state  zt^i  is  predicted  to  be  impossibly  good,  i.e. 
V^{F{zt+i))  <  ObjBound.  Denote  this  search  trajectory  by  {zo,Zi, . . .  ,zt). 

Note  that  this  stochastic  hillclimbing  trajectory  begins  from  where  the  previous 
TT  trajectory  ended.  This  decision  is  evaluated  empirically  in  Section  5.2.4.  V'^ 
leads  search  to  a  state  which  is  predicted  to  be  a  good  new  start  state  for  tt. 
We  cut  off  the  search  if  it  reaches  a  state  which  promises  a  solution  better  than 
a  known  lower  bound  for  the  problem.  For  example,  in  bin-packing,  we  cut  off 
^  predicts  that  tt  will  lead  to  a  solution  which  uses  a  negative  number  of 
bins!  This  refinement  can  help  prevent  V'^  from  leading  search  too  far  astray  if 
the  function  approximation  is  very  inaccurate.  I  show  the  empirical  benefits  of 
this  refinement  in  Section  5.2.5. 

Step  2d:  Set  smart  restart  state.  Set  xo  :=  zt.  But  in  the  event  that  the 

hillclimbing  search  accepted  no  moves  (i.e.,  Zt  =  xj),  then  reset  xq  to  a  new 
random  starting  state. 

This  reset  operation  is  occasionally  necessary  to  *^un-stick”  the  search  from  a 
state  which  is  a  local  optimum  of  both  Obj  and  Y'^ .  For  example,  on  STAGE’S 
first  iteration,  Fit  has  been  trained  on  only  one  outcome  of  tt,  so  Y"^  will  be 
constant  presenting  no  hill  to  climb.  Lacking  any  information  about  which 
search  directions  lead  to  smart  restart  states,  we  restart  randomly.  This  provi¬ 
sion  ensures  that  STAGE  reverts  to  random  multi-restart  ^  in  certain  degenerate 
cases,  e.g.  if  every  state  in  X  has  identical  features. 

STAGE  terminates  as  soon  as  TotEvals  states  have  been  evaluated.  This  count 
includes  both  accepted  and  rejected  states  considered  during  both  the  Step  2a  search 
and  the  Step  2c  search. 

3.3  Illustrative  Examples 

We  now  illustrate  STAGE’S  performance  on  the  two  sample  domains  described  earlier 
in  this  chapter,  the  one-dimensional  wave  function  and  the  small  bin-packing  problem. 

3.3.1  1-D  Wave  Function 

For  the  wave  function  example  of  Figure  3.4  (p.  48),  the  baseline  search  from  which 
STAGE  learns  is  hillclimbing  with  neighborhood  moves  of  ±0.1.  We  encode  the 
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STAGE(X,  S,  TT,  Obj,  ObjBound,  F,  Fit,  N,  Pat,  TotEvals): 

Given: 

•  a  state  space  X 

•  starting  states  S  C  X  (and  a  method  for  generating  a  random  state  in  S) 

•  a  local  search  procedure  tt  that  is  Markovian  and  proper  (e.g.,  hillclimbing) 

•  an  objective  function,  Obj  :  A  — >  5R,  to  be  minimized 

•  a  lower  bound  on  Obj,  ObJ BOUND  G  3?  (or  —  oo  if  no  bound  is  known) 

•  a  featurizer  F  mapping  states  to  real-valued  features,  F  :  X  -A  3?^ 

•  a  function  approximator  Fit 

•  a  neighborhood  structure  N  :  X  -^2^  and  patience  parameter  PaT  for  running 
stochastic  hillclimbing  on 

•  TotEvals,  the  number  of  state  evaluations  allotted  for  this  run 

1.  Initialize  the  function  approximator;  let  xo  €  S'  be  a  random  starting  state  for 
search. 

2.  Loop  until  number  of  states  evaluated  exceeds  TotEvals: 

(a)  Optimize  Obj  using  tt.  From  xq,  run  search  algorithm  tt,  producing  a 
search  trajectory  (a^o,  a:i,  a:2, . . .  ,  xj). 

(b)  Train  V^.  For  each  point  Xi  on  the  search  trajectory,  define  pi  := 
niinj=i...T  Obj(a;j),  and  add  the  pair  {F(xi)  1-4  y^}  to  the  training  set  for 
Fit.  Retrain  Fit;  call  the  resulting  learned  evaluation  function  V^. 

(c)  Optimize  V'^  using  hillclimbing.  Continuing  from  xt,  optimize 

V'^{F{x))  by  performing  a  stochastic  hillclimbing  search  over  the  neigh¬ 
borhood  structure  N.  Cut  off  the  search  when  either  Pat  consecutive 
moves  produce  no  improvement,  or  a  candidate  state  is  predicted  to 
be  impossibly  good,  i.e.  V’^(F(2t+i))  <  ObjBound.  Denote  this  search 
trajectory  by  (^o,  •  •  •  ,  Zt). 

(d)  Set  smart  restart  state.  Set  xq  :=  zt.  But  in  the  event  that  the  V'^ 
hillclimbing  search  accepted  no  moves  (i.e.,  Zt  =  xj),  then  reset  xq  to  a 
new  random  starting  state. 

3.  Return  the  best  state  found. 


Table  3.1.  The  STAGE  algorithm 
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state  using  a  single  input  feature,  x  itself,  and  we  model  V'^  by  quadratic  regres¬ 
sion.  Thus,  STAGE  will  be  building  parabola-shaped  approximations  to  the  staircase¬ 
shaped  true  .  We  assume  no  prior  knowledge  of  a  bound  on  the  objective  function, 
i.e.,  ObjBound  =  -cx). 

A  sample  run  is  depicted  in  Figure  3.6.  The  first  iteration  begins  at  a  random 
starting  state  of  xq  =  9.3  and  greedily  descends  to  the  local  minimum  at  (9,  —1).  In 
Step  2b,  our  function  approximator  trains  on  the  trajectory’s  feature/outcome  pairs: 
{{9.3  — 1},{9.2  (-)■  —1},  {9.1  —1},  {9.0  —1}}  (shown  in  the  diagram  as  small 
diamonds).  The  resulting  least-squares  quadratic  approximation  is,  of  course,  the 
line  V'^  =  In  Step  2c,  hillclimbing  on  this  flat  function  accepts  no  moves,  so  in 
Step  2d,  we  reset  to  a  new  random  state — in  this  example  run,  xo  =  7.8. 


.10  4  ^  ^  ^  0  2  4  6  6  10  -10  >8  -6  -4  -2  0  2  4  6  8  10 

ttatnx  ftatesx 


Figure  3.6.  STAGE  working  on  the  1-D  wave  example 

On  the  second  iteration  (top  right),  greedy  descent  leads  to  the  local  minimum  at 
(8,  —2),  producing  three  new  training  points  for  V'^  (shown  as  small  ’-f’  symbols).  The 
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new  best-fit  parabola  predicts  that  fantastically  promising  starting  states  can  be  found 
to  the  left.  Hillclimbing  on  this  parabola  in  Step  2c,  we  move  all  the  way  to  the  left 
edge  of  the  domain,  xq  =  -10,  from  which  V"^  predicts  that  hillclimbing  will  produce 
an  outcome  of  —212!  (Note  that  with  a  known  ObjBound,  STAGE  would  have 
recognized  overoptimistic  extrapolation  and  cut  off  hillclimbing  before  reaching 
the  left  edge.  But  no  harm  is  done.) 

The  third  iteration  (bottom  left)  quickly  punishes  V^^’s  overenthusiasm.  Greedy 
descent  takes  one  step,  from  (—10,0)  to  (—9.9,  —0.08).  The  two  new  training  points, 
(—10  1-^  —0.08)  and  (—9.9  —0.08)  (shown  as  squares),  give  rise  to  a  nice  concave 

parabola  which  correctly  predicts  that  the  best  starting  points  for  tt  are  near  the 
center  of  the  domain.  Hillclimbing  on  produces  a  smart  restart  point  of  a;o  —  0*1. 
From  there,  on  iteration  4,  tt  easily  reaches  the  global  optimum  at  (0,-10).  After 
ten  iterations  (bottom  right),  much  more  training  data  has  been  gathered  near  the 
center  of  the  domain,  and  leads  tt  to  the  global  optimum  every  time. 

This  problem  is  contrived,  but  its  essential  property— that  features  of  the  state 
help  to  predict  the  performance  of  an  optimizer — does  indeed  hold  in  many  practical 
domains.  Simulated  annealing  does  not  take  advantage  of  this  property,  and  indeed 
performs  poorly  on  this  problem.  This  problem  also  illustrates  that  STAGE  does 
more  than  simply  smoothing  out  the  wiggles  in  Obj  (x) — doing  so  here  would  produce 
an  unhelpful  flat  function.  STAGE  does  smooth  out  the  wiggles,  but  in  a  way  that 
incorporates  predictive  knowledge  about  local  search. 

3.3.2  Bin-packing 

Our  second  illustrative  domain,  bin-packing,  is  more  typical  of  the  kind  of  practical 
combinatorial  optimization  problem  we  want  STAGE  to  attack.  We  return  to  the 
example  instance  of  Figure  3.1  (p.  42).  The  baseline  search  from  which  STAGE  will 
learn  is  stochastic  hillclimbing  over  the  search  neighborhood  described  on  page  43. 
The  starting  state  is  the  packing  which  places  each  item  in  its  own  separate  bin. 

Recall  that  in  Figure  3.3  (p.  46),  we  observed  that  a  simple  state  feature,  vari¬ 
ance  in  bin  fullness  levels,  correlated  with  the  quality  of  solution  hillclimbing  would 
eventually  reach.  To  apply  STAGE,  we  will  use  this  feature  and  the  true  objective 
function  to  encode  each  state.  We  again  model  V'"  by  quadratic  regression  over  these 
two  features.  We  assume  no  prior  knowledge  of  bounds  on  the  objective  function, 
i.e.,  ObjBound  =  — oo. 

Snapshots  from  iterations  1,  2,  3  and  7  of  a  STAGE  run  are  depicted  in  Figure  3.7. 
On  the  first  iteration  (top  left  plot),  STAGE  hillclimbs  from  the  initial  state  (Obj(a;)  = 
30,  Var(a;)  =  0.011)  to  a  local  optimum  (Obj(a:)  =  13,  Var(a:)  =  0.019).  Training  each 
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state  of  that  trajectory  to  predict  the  outcome  13  results  in  a  flat  function  (top 
right).  Hillclimbing  on  this  flat  V'^  accepts  no  moves,  so  in  Step  2d  STAGE  resets  to 
the  initial  state. 

On  the  second  iteration  of  STAGE  (second  row  of  Figure  3.7),  the  new  stochastic 
hillclimbing  trajectory  happens  to  do  better  than  the  first,  finishing  at  a  local  optimum 
(Obj(a;)  =  11,  Var(a:)  =  0.022).  Our  training  set  is  augmented  with  target  values  of  11 
for  all  states  on  the  new  trajectory.  The  resulting  quadratic  V'^  already  has  significant 
structure.  Note  how  the  contour  lines  of  V'^,  shown  on  the  base  of  the  surface  plot, 
correspond  to  smoothed  versions  of  the  trajectories  in  our  training  set.  Extrapolating, 

predicts  that  the  the  best  starting  points  for  tt  are  on  arcs  with  higher  Var(x). 

STAGE  hillclimbs  on  the  learned  to  try  to  find  a  good  starting  point.  The 
trajectory,  shown  as  a  dashed  line  in  the  third  plot,  goes  from  (Obj(a:)  =  11,  Var(x)  = 
0.022)  up  to  (Obj(a;)  =  12,Var(a:)  =  0.105).  Note  that  the  search  was  willing  to 
accept  some  harm  to  the  true  objective  function  during  this  stage.  From  the  new 
starting  state,  hillclimbing  on  Obj  does  indeed  lead  to  a  yet  better  local  optimum  at 
(Obj(a:)  =  10,  Var(a;)  =  0.053). 

During  further  iterations,  the  approximation  of  V'^  is  further  refined.  Continuing 
to  alternate  between  standard  hillclimbing  on  Obj  (solid  trajectories)  and  hillclimbing 
on  STAGE  manages  to  discover  the  global  optimum  at  (Obj(a:)  =  9,  Var(a;)  = 
0)  on  iteration  seven.  STAGE’S  complete  trajectory  is  plotted  at  the  bottom  left 
of  Figure  3.7.  Contrast  this  STAGE  trajectory  with  the  multi-restart  stochastic 
hillclimbing  trajectory  shown  in  Figure  3.8,  which  never  reached  any  solution  better 
than  Obj(a;)  =  11. 

This  example  illustrates  STAGE’S  potential  to  exploit  high-level  state  features 
to  improve  performance  on  combinatorial  optimization  problems.  It  also  illustrates 
the  benefit  of  training  on  entire  trajectories,  not  just  starting  states:  in  this  run 
a  useful  quadratic  approximation  was  learned  after  only  two  iterations.  Extensive 
results  on  larger  bin-packing  instances,  and  on  many  other  large-scale  domains,  are 
presented  in  Chapter  4. 


3.4  Theoretical  and  Computational  Issues 

STAGE  is  a  general  algorithm:  it  learns  to  predict  the  outcome  of  a  local  search 
method  tt  with  a  function  approximator  Fit  over  a  feature  space  F.  There  are  many 
choices  for  tt.  Fit,  and  F.  This  section  describes  how  to  make  those  choices  so  that 
STAGE  is  well-defined  and  efficient. 
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Figure  3.8.  Trajectory  of  multi-restart  hillclimbing  in  bin-packing  feature  space 


3.4.1  Choosing  n 

STAGE  learns  from  a  baseline  local  search  procedure  tt.  Here,  I  identify  precise 
conditions  on  tt  that  make  it  suitable  for  STAGE.  First,  some  definitions. 

A  run  of  a  local  search  procedure  tt,  starting  from  state  Xq  E  X,  stochastically 
produces  a  sequence  of  states  called  a  trajectory,  denoted  r  =  {xq,  x'l,  x^, . . . ).  In 
general,  the  trajectory  may  be  infinite,  or  it  may  terminate  after  a  finite  number 
of  steps  |t|.  For  any  terminating  trajectory,  we  define  xj  to  equal  a  special  state, 
denoted  END,  for  all  i  >  |r|. 

Let  T  denote  the  set  of  all  possible  trajectories  over  X.  Formally,  tt  is  characterized 
by  the  probability  with  which  it  generates  each  trajectory: 

Definition  1.  A  local  search  procedure  is  a  stochastic  function  tt  :  X  T.  Given  a 
starting  state  x  €  X,  let  P(t|x5  =  x)  denote  the  probability  distribution  with  which 
TT  produces  t 

I  now  define  three  key  properties  that  tt  may  have: 

Definition  2.  A  local  search  procedure  tt  is  said  to  be  proper  if,  from  any  starting 
state  X  ^  X,  the  END  state  is  eventually  reached  with  probability  1. 

Definition  3.  A  local  search  procedure  tt  is  said  to  be  Markovian  if,  no  matter  when 
a  state  Xi  is  visited,  a  step  to  another  state  occurs  with  the  same  fixed  probability, 
denoted  p{xi^i  |xi).  In  symbols:  for  all  i  G  N  and  xq,  xi, . . .  ,  Xj,  Xj+i  €  U  {end}, 

P(x[^l  =  Xi+i  I  X5  =  Xo,  =  Xi,  .  .  .  ,  X[  =  Xi)  =  p(xi+i|xi) 


58 


LEARNING  FOR  GLOBAL  OPTIMIZATION 


Definition  4.  A  local  search  procedure  it  is  said  to  be  monotonic  with  respect  to 
Obj  if,  for  every  trajectory  r  €  T  that  tt  can  generate, 

Obj(a:S)>Obj(xI)>Obj(a;^)>... 

If  all  the  inequalities  are  strict  (until  perhaps  reaching  an  END  state)  for  every  tra¬ 
jectory,  then  TT  is  called  strictly  monotonic.  ' 

Note  that  the  conditions  of  being  proper,  Markovian,  or  monotonic  are  independent: 
a  local  search  procedure  may  satisfy  none,  any  one,  any  two,  or  all  three  definitions. 

What  conditions  on  tt  make  it  suitable  for  learning  by  STAGE?  The  essence  of 
STAGE  is  to  approximate  the  function  which  predicts  the  expected  best  Obj 

value  on  a  trajectory  that  starts  from  state  x  and  follows  procedure  tt.  Formally,  we 
define  as  follows. 

Definition  5.  For  any  local  search  procedure  tt  and  objective  function  Obj  :  X  — >  5R, 
the  function  F’"  :  X  ->•  9?  is  defined  by 

V'(i)  E  [inf{0bj(4)  lb  =  0, 1, 2, . . . }] . 

where  the  expectation  is  taken  over  the  trajectory  distribution  P{t\xI  =  x). 

Under  one  very  reasonable  assumption,  we  can  show  that  is  well-defined  for 
any  policy  tt,  i.e.,  that  the  expectation  of  Definition  5  exists; 

Proposition  1.  If  Obj  is  bounded  below,  then  is  well-defined  at  every  state 

X  E  X  and  for  all  policies  tt. 

Proof.  Writing  out  the  expectation  of  Definition  5,  we  have 

=  «)inf{0bj(4)  I  =  0,1,2,...}. 

T€r 

Each  trajectory’s  infimum  is  bounded  below  by  the  assumed  global  bound  on  Obj, 
and  bounded  above  by  the  value  of  the  starting  state,  Obj(a;5)  =  Obj(x).  Thus, 
V^{x)  is  a  convex  sum  of  bounded  quantities,  which  is  a  well-defined  quantity.  □ 

is  well-defined  for  any  policy  tt,  even  improper  ones.  However,  STAGE  learns 
by  collecting  multiple  sample  trajectories  of  tt;  and  in  order  for  that  to  make  sense,  the 
trajectories  must  terminate.  Hence,  STAGE  requires  that  tt  be  proper  (Definition  2). 
Later  in  this  section,  I  will  discuss  several  ways  of  turning  improper  policies  into 
proper  ones  for  use  with  STAGE. 
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STAGE  also  requires  tt  to  be  Markovian.  The  reason  for  this  condition  is  that  when 
TT  is  Markovian,  STAGE  can  use  the  sample  trajectory  data  it  collects  much  more 
efficiently.  The  key  insight  is  that  in  any  trajectory  r  =  (a;o,  xi,. . .  ,  Xi,  Xj+i,. each 
tail  of  the  trajectory  (xi,  Xj+i, . . . )  is  generated  with  exactly  the  same  probability  as 
if  TT  had  started  a  new  trajectory  from  Xj.  In  other  words,  when  tt  is  Markovian, 
every  state  visited  is  effectively  a  new  starting  state.  This  means  that  in  Step  2b  of 
the  STAGE  algorithm  (refer  to  page  52),  STAGE  can  use  every  state  of  every  sample 
trajectory  as  training  data  for  approximating  V^. 

Exploiting  the  Markov  property  can  increase  the  amount  of  available  training  data 
by  several  orders  of  magnitude.  In  practice,  though,  the  extra  training  points  collected 
this  way  may  be  highly  correlated,  so  it  is  unclear  how  much  they  will  improve 
optimization  performance.  Section  5.2.3  shows  empirically  that  the  improvement  can 
be  substantial. 

So  far,  we  have  required  that  tt  be  proper  for  reasons  of  algorithmic  validity  and 
that  TT  be  Markovian  for  reasons  of  data  efficiency.  These  are  the  only  conditions 
imposed  by  the  basic  STAGE  algorithm  of  page  52.  However,  we  can  impose  the 
additional  condition  that  tt  be  monotonic,  for  reasons  of  memory  efficiency.  When 
TT  is  monotonic,  Markovian  and  proper,  the  infimum  in  the  definition  of  can  be 
rewritten  as  an  infinite  sum: 


V^{x)  =  E  [inf{Obj(xfc)  |  A;  =  0, 1, 2, . . .  }] 

=  E[  lim  Obj(x^)]  (since  tt  is  monotonic) 


=  E 


k-^oo 
oo 

Lfc=o 


(since  tt  is  proper) 


where  all  expectations  are  taken  over  the  trajectory  distribution  P(t|xo 
the  additive  cost  function  R  is  defined  by 


(3.3) 


x),  and 


p/  /\  if  ^  7^  end  and  x'  =  END, 

IX\X^X  )  “  % 

10  otherwise. 

Writing  in  this  form  reveals  that  it  is  precisely  the  policy  value  function  of 
the  Markov  chain  (X,  7r,i2),  as  defined  earlier  in  Section  2.1.1.  This  means  that 
satisfies  the  Bellman  equations  for  prediction  (Eq.  2.3),  and  all  the  algorithms  of  re¬ 
inforcement  learning  apply.  In  particular,  using  the  method  of  Least-Squares  TD(A), 
STAGE  can  learn  a  linear  approximation  to  without  ever  storing  a  trajectory  in 
memory,  thereby  reducing  memory  usage  significantly,  and  with  no  additional  com¬ 
putational  expense  over  supervised  linear  regression.  The  details  of  this  technique 
are  given  in  Section  6.1. 
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We  have  shown  that  in  order  for  STAGE  to  learn  from  a  local  search  procedure  tt, 
it  is  desirable  for  ir  to  be  proper,  Markovian  and  monotonic.  The  remainder  of  this 
section  investigates  to  what  extent  these  conditions  hold,  or  can  be  made  to  hold,  for 
commonly  used  local  search  algorithms. 

Steepest-descent  hillclimbing.  Steepest-descent  takes  a  step  from  a;  to  a  neigh¬ 
boring  state  x'  €  N{x)  which  maximally  improves  over  Obj(x).  If  no  neighbors 
improve  Obj,  the  trajectory  terminates. 

Steepest-descent  is  strictly  monotonic.  A  strictly  monotonic  policy  never  visits 
the  same  state  twice  on  any  trajectory,  so  if  X  is  finite  (as  is  the  case  with  com¬ 
binatorial  optimization  problems),  steepest-descent  is  guaranteed  to  terminate. 
Steepest-descent  is  also  clearly  Markovian:  at  a  local  optimum  it  terminates 
deterministically,  and  from  any  other  state  it  steps  with  equal  probability  to 
any  x' G  ^(0;)  for  which  Obj(a:')  =  min2g7v(a:)  Obj (z). 

Thus,  steepest-descent  procedures  are  strictly  monotonic,  Markovian,  and  (if  X 
is  finite)  proper. 

Stochastic  hillclimbing.  For  search  problems  where  the  neighborhoods  N{x)  are 
large,  stochastic  hillclimbing  is  cheaper  to  run  than  steepest-descent.  We  con¬ 
sider  first  the  case  where  equi-cost  moves  are  rejected,  that  is,  a  move  from  x 
to  x'  is  accepted  only  if  x'  belongs  to  the  set  G{x)  =  {g  G  N{x)  :  Obj(^)  ^ 
Obj(x)}.  Let  HC  represent  this  procedure  for  a  given  state  space  X,  neighbor¬ 
hood  function  W,  objective  function  Obj,  and  patience  value  Pat. 

HC  is  strictly  monotonic.  As  above,  assuming  X  is  finite,  HC  is  guaranteed  to 
terminate.  HC  is  also  Markovian,  with  the  following  transition  probabilities: 

p(END|x)  = 

|g|jji(l -p(end|x))  ifx'GG(x) 

0  if  a:'  ^  G{x)  U  {end}. 

These  transition  probabilities  assume  that  all  neighbors  are  equally  likely  to 
be  sampled;  it  is  straightforward  to  reweight  them  in  the  case  of  non-uniform 
sampling  distributions. 

However,  if  we  drop  the  assumption  of  rejecting  equi-cost  moves,  stochastic  hill¬ 
climbing  with  patience-based  termination  is  no  longer  Markovian.  The  reason 
is  that  after  an  equi-cost  move,  the  patience  counter  is  not  reset  to  zero,  so 
p(end|x)  is  not  fixed  but  rather  depends  on  the  previous  states  visited.  Possi¬ 
bility  3  listed  in  the  next  paragraph  describes  a  remedy. 


p(x'|x)  = 
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Biased  random  walks.  This  general  category  includes  any  local  search  procedure 
where  state  transitions  are  memoryless:  stochastic  hillclimbing  with  or  without 
equi-cost  moves;  force-best-move;  GSAT  and  WALKS  AT;  and  random  walks  in 
the  state  space  X.  These  procedures  are  Markovian  over  X  but,  in  general, 
not  proper.  To  make  them  proper,  a  termination  condition  must  be  specified. 
I  consider  three  possibilities: 

Possibility  1:  Run  the  procedure  for  a  fixed  number  of  steps. 

This  destroys  the  Markov  condition,  since  the  termination  probability 
p(END|a;)  depends  not  just  on  x  but  on  the  global  step  counter.  It  is 
possible,  however,  to  include  this  counter  as  part  of  the  state,  as  I  will 
discuss  under  the  heading  of  Simulated  Annealing  below. 

Possibility  2;  Introduce  a  termination  probability  e  >  0. 

If  p(END|a:)  >  e  for  every  state  x,  then  the  procedure  remains  Markovian 
and  becornes  proper,  thus  making  it  suitable  for  STAGE.  However,  this 
approach  may  randomly  cause  termination  to  occur  during  a  fruitful  part 
of  the  search  trajectory. 

Possibility  3:  Use  patience-based  termination  and  the  best-so-far  abstraction. 

This  approach  means  cutting  off  search  after  PAT  consecutive  steps  have 
failed  to  improve  on  the  best  state  found  so  far  on  the  trajectory.  This 
makes  the  search  procedure  proper  if  |A^|  is  finite,  but  breaks  the  Markov 
property,  since  p(END|a:)  depends  on  not  just  x  but  also  the  current  pa¬ 
tience  counter  and  the  best  Obj  value  seen  previously.  However,  we  can 
use  a  simple  trick  to  reclaim  the  Markov  property. 

Definition  6.  Given  a  local  search  procedure  tt,  the  best-so-far  abstraction 
of  this  policy  is  a  new  policy  BSF(7r)  which  filters  out  all  but  the  best- 
so-far  states  on  each  trajectory  produced  by  tt.  That  is,  if  tt  produces 
T  =  {xq^xI^xIi  -  . .)  with  probability  P{t\xq  =  Xq),  then  with  that  same 
probability,  BSF(7r)  produces 


where  (io,  ii,i2, . . . )  is  the  subsequence  consisting  of  all  indices  that  satisfy 


Obj  (a;,-,)  <  ^  min  Oh}{xj) 
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For  example,  given  a  trajectory  of  tt  with  associated  Obj  values: 

a;o,  Xi,  X2,  xs,  X4,  x^,  Xq,  xj,  Xs,  END 

4'4'4'4'4'4'4'4'4' 

42,  44,  39,  31,  35,  31,  40,  28,  30 

the  procedure  BSF(7r)  would  produce  the  monotonic  trajectory 

xo,  X2,  X3,  .  X7,  END 

;  I  i  ; 

42,  39,  31,  28 

Given  this  definition,  we  can  prove  the  following: 

Proposition  2.  If  local  search  procedure  tt  is  Markovian  over  a  finite  state 
space  X,  and  tt'  is  the  procedure  that  results  by  adding  patience-based  ter¬ 
mination  to  IT,  then  procedure  BSF(7r^)  is  proper,  Markovian,  and  strictly 
monotonic. 

The  proof  is  given  in  Appendix  A.l.  In  practical  terms,  this  means  we  can 
run  STAGE  just  as  described  on  page  52,  except  that  in  Step  2b,  we  train 
the  fitter  on  only  the  best-so-far  states  of  each  sample  trajectory.  Com¬ 
pared  with  random  termination  (Possibility  2),  patience-based  termination 
vastly  reduces  the  number  of  training  samples  we  collect  for  fitting  V^,  but 
gives  us  both  a  more  natural  cutoff  criterion  and  the  monotonicity  needed 
to  apply  reinforcement-learning  methods.  In  Section  4.7,  I  compare  these 
two  possibilities  empirically  on  the  domain  of  Boolean  satisfiability. 

Simulated  annealing.  The  steps  taken  during  simulated  annealing  search  depend 
on  not  only  the  current  state  but  also  a  time-varying  temperature  parameter 
ti  >  0.  In  particular,  from  state  x,-  at  time  i,  simulated  annealing  evaluates  a 
random  neighbor  x'  e  N{xi)  and  sets 

{END  if  i  >  TotEvals 

x'  if  RAND  <  (3.5) 

Xj  otherwise 

where  RAND  is  a  random  variable  uniformly  chosen  from  [0, 1).  The  temperature 
ti  decreases  over  time.  At  high  temperatures,  moves  that  worsen  the  objective 
function  even  by  quite  a  lot  are  often  accepted,  whereas  at  low  temperatures, 
worsening  moves  are  usually  rejected.  Improving  moves  and  equi-cost  moves 
are  always  accepted  no  matter  what  the  temperature. 
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Because  of  the  dependence  on  tenaperature,  simulated  annealing  is  not  Marko¬ 
vian.  However,  any  local  search  procedure  can  be  made  Markovian  by  aug¬ 
menting  the  state  space  with  whatever  extra  variables  are  relevant  to  future 
transition  probabilities.  In  the  case  of  simulated  annealing,  if  the  temperature 
schedule  i,  is  fixed  in  advance,  it  suffices  to  augment  X  with  a  single  variable 
i  G  N,  the  move  counter.  For  example,  if  tt  uses  the  schedule 


ti 


2.0 -i/1000  if  i<  1000 

0  ggO-iooo)  -f  ■  >  ;^Q00  ’ 


which  decays  the  temperature  linearly  and  then  exponentially,  then  tt  is  Marko¬ 
vian  in  X  X  N.  In  the  expanded  space,  i)  predicts  the  outcome  of  simulated 

annealing  when  search  starts  from  state  x  at  time  i. 

However,  this  formulation  is  of  limited  usefulness  to  STAGE.  For  any  fixed  x,  we 
expect  the  best  value  of  to  occur  at  i  =  0,  since  search  should  always  benefit 
from  having  more  time  remaining.  Thus,  in  Step  2c,  STAGE  can  fix  i  =  0  while 
searching  for  a  good  starting  point;  but  then  there  is  little  benefit  in  having 
trained  on  all  the  simulated  annealing  trajectory  states  with  i  >  0.  In  other 
words,  it  would  seem  that  to  apply  the  basic  STAGE  algorithm  to  simulated 
annealing,  one  may  as  well  train  on  only  the  actual  starting  state  xq  of  each 
trajectory,  and  forego  the  improved  data  efficiency  that  the  Markov  assumption 
usually  brings.  I  test  this  empirically  in  Section  5.2.3.  Later,  in  Section  8.2.2,  I 
also  discuss  a  modified  version  of  STAGE  which  allows  simulated  annealing  to 
exploit  more  fully. 


This  section  has  analyzed  a  variety  of  hillclimbing  and  random- walk  local  search 
procedures  from  which  STAGE  can  learn.  From  a  theoretical  point  of  view,  the  ideal 
procedure  should  be  proper,  Markovian,  and  monotonic.  From  a  practical  point  of 
view,  the  procedure  should  also  be  (1)  effective  at  finding  good  solutions  on  its  own, 
so  STAGE  begins  from  a  high  performance  baseline;  and  (2)  predictable,  so  that 
has  learnable  structure.  In  practice,  stochastic  hillclimbing  rejecting  equi-cost  moves 
seems  to  be  a  good  choice;  it  is  used  for  the  bulk  of  the  results  in  Chapter  4.  Alter¬ 
native  choices  for  tt  are  explored  in  Section  4.7  (tt  =  WALKSAT)  and  Section  5.2.3 
(tt  =  simulated  annealing). 

3.4.2  Choosing  the  Features 

STAGE  approximates  with  statistical  regression  over  a  real- valued  feature  rep¬ 
resentation  of  the  state  space  X,  «  V'^{x).  Clearly,  the  quality  of  the 
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approximation  will  depend  on  the  feature  representation  we  choose.  As  with  any 
function  approximation  task,  the  features  are  “usually  handcrafted,  based  on  what¬ 
ever  human  intelligence,  insight,  or  experience  is  available,  and  are  meant  to  capture 
the  most  important  aspects  of  the  current  state”  [Bertsekas  and  Tsitsiklis  96]. 

As  discussed  earlier  in  Section  3.1.3,  most  practical  problems  are  awash  in  features 
that  could  help  predict  the  outcome  of  local  search.  In  the  bin-packing  example 
of  Section  3.1.3,  we  listed  half  a  dozen  plausible  features,  such  as  variance  in  bin 
fullness.  For  a  travelling  salesperson  problem,  some  reasonable  features  of  a  tour  x 
might  include 

•  Obj(a:)  =  the  sum  of  the  intercity  distances  in  x 

•  the  variance  of  the  distances  in  x.  (This  could  identify  whether  tours  with  some 
short  and  some  long  hops  are  more  or  less  promising  than  tours  with  mostly 
medium-length  hops.) 

•  the  number  of  improving  steps  in  the  search  neighborhood  of  x.  (The  general 
usefulness  of  this  feature  as  a  local  search  heuristic  has  been  investigated  by 
[Moll  et  al.  97].) 

•  for  each  city  c,  the  distance  to  the  next  city  d  assigned  by  x.  (These  fine¬ 
grained  features  nearly  specify  x  completely,  but  may  be  too  numerous  for 
efficient  learning.) 

•  geometric  features  of  the  tour  (if  applicable),  such  as  the  average  bend  angle  at 
each  city  or  average  Ax  and  Ay  of  the  distances  in  x. 

Empirically,  STAGE  seems  to  do  at  least  as  well  as  random  multi-restart  tt  no 
matter  what  features  are  chosen.  Note  that  if  no  features  are  used,  is  always 
constant,  and  STAGE  reduces  to  random  multi-restart  tt.  I  generally  choose  just 
a  few  coarse,  simple-to-compute  features  of  a  problem  space,  such  as  the  variance 
features  mentioned  above  or  subcomponents  of  the  objective  function.  Using  only  a 
few  features  minimizes  the  computational  overhead  of  training  and  works  well  in 
practice.  Ghapter  4  gives  many  more  examples  of  effective  feature  sets  for  large-scale 
domains,  and  Section  5.2.1  investigates  the  empirical  effect  of  using  different  feature 
sets. 

3.4.3  Choosing  the  Fitter 

STAGE  relies  on  a  function  approximator  Fit  to  produce  V'^  from  sample  training 
data.  Examples  of  function  approximators  include  polynomial  regression;  memory- 
based  methods  such  as  A;-nearest-neighbor  and  locally  weighted  regression  [Cleveland 


§3.4  THEORETICAL  AND  COMPUTATIONAL  ISSUES 


65 


and  Devlin  88];  neural  networks  such  as  multi-layer  perceptrons  [Rumelhart  et  al.  86], 
radial  basis  function  networks  [Moody  and  Darken  89],  and  cascaded  architectures 
[Fa  him  an  and  Lebiere  90];  CMACs  [Albus  81];  multi-dimensional  splines  [Friedman 
91];  decision  trees  such  as  CART  [Breiman  et  al.  84];  and  many  others. 

What  qualities  of  Fit  make  it  most  suitable  for  STAGE?  The  most  important 
requirements  are  the  following:  , 

Incremental  STAGE  trains  on  many  states — perhaps  on  the  order  of  millions — 
during  the  course  of  an  optimization  run,  so  the  fitter  must  be  able  to  handle 
large  quantities  of  data  without  an  undue  memory  or  computational  burden. 
Training  occurs  once  per  STAGE  iteration  and  must  be  efficient.  Evaluating 
the  learned  function  occurs  on  every  step  of  hillclimbing  on  V'^  and  must  be 
very  efficient.  In  the  terminology  of  [Sutton  and  Whitehead  93],  the  fitter  must 
be  strictly  incremental. 

Noise-tolerant  The  training  values  STAGE  collects  are  the  outcomes  of  long  stochas¬ 
tic  search  trajectories.  Thus,  the  fitter  must  be  able  to  tolerate  substantial  noise 
in  the  training  set. 

Extrapolating  STAGE  hillclimbs  on  the  learned  function  in  search  of  promising, 
previously  unvisited  states.  Thus,  STAGE  can  benefit  from  a  fitter  that  extrap¬ 
olates  trends  from  the  training  samples.  Figure  3.9  contrasts  the  fits  learned  by 
quadratic  regression  and  1-nearest-neighbor  on  a  small  one-dimensional  train¬ 
ing  set.  Even  though  the  quadratic  approximation  has  worse  residual  error  on 
the  training  samples,  it  is  more  useful  for  STAGE’S  hillclimbing.  Note  that 
stage’s  Ob J Bound  cutoff  helps  compensate  in  cases  where  the  fitter  over¬ 
extrapolates. 


2  4  6  8  10  I  2  4  6  8  10 


Figure  3.9.  Quadratic  regression  and  1-nearest-neighbor 


66 


LEARNING  FOR  GLOBAL  OPTIMIZATION 


Given  these  requirements,  the  ideal  function  approximators  for  STAGE  axe  those 
in  the  class  of  linear  architectures.  Following  the  development  of  [Bertsekas  and 
Tsitsiklis  96],  a  general  linear  architecture  has  the  form 

K 

^  /^)  =  X]  (3.6) 

fei 

where  P[l]^(3[2]^ . . .  ^(^[K]  are  the  components  of  the  coefficient  vector  /3,  and  the  (j)k 
are  fixed,  easily  computable  real-valued  basis  functions.  For  example,  if  the  state  x 
ranges  over  and  <pk{^)  =  ^  for  each  k  =  1 . , ,  then  V  represents  a  polynomial 

in  X  of  degree  —  1.  Other  examples  of  linear  architectures  include  CMACs,  random 
representation  networks  [Sutton  and  Whitehead  93],  radial  basis  function  networks 
with  fixed  basis  centers,  and  multi-dimensional  polynomial  regression.  Figure  3.10 
illustrates  the  chain  of  mappings  by  which  STAGE  produces  a  prediction  from  an 
optimization  state  x. 

The  particular  linear  architecture  I  prefer  for  STAGE  is  quadratic  regression. 
Quadratic  regression  produces  K  =  (Z)  +  l)(Z)-h2)/2  basis  functions  from  the  features 
=  /15/2, ...  ,/d: 

$(F(a;))  =  (l,/i,/2,...  ,/p,  JJd,  fl-..  ,f2fD,  ,  fh)  (3.7) 

Quadratic  regression  is  flexible  enough  to  capture  global  first-order  and  second-order 
trends  in  the  feature  space  and  to  represent  a  global  optimum  at  any  point  in  feature 
space,  but  also  biased  enough  to  smooth  out  significant  training  set  noise  and  to 
extrapolate  aggressively. 


state 

features 

basis  vector 

prediction 

xex 

F{x)  € 

1-^ 

$(F(x))  € 

1-^ 

13  .  $(F(x))  G  3? 

. .  a. 

{Obj  =  14, 

Var  =  0.027} 

{1,14,0.027, 

196,0.381, 

0.000741} 

12.4 

Figure  3.10.  Approximation  of  by  a  linear  architecture  (quadratic  regression) 


Linear  architectures  meet  the  “strictly  incremental”  requirement:  compared  with 
any  of  the  other  function  approximators  listed  at  the  beginning  of  this  section,  training 
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them  is  very  efl&cient  in  time  and  memory.  Let  our  training  set  be  denoted  by 

{<^(1)  ^  2/15  ^(2)  >->■  2/25  •  •  •  5  '->■  2/iv} 

where  =  (^i(F(xi)), . . .  is  the  i***  training  basis  vector,  and  yi  is 

the  1**’  target  output.  The  goal  of  training  is  to  find  coefiicients  that  minimize  the 
squared  residuals  between  predictions  and  target  values;  that  is, 

N 

13*  =  argmin ^{yi  -  (3  ■  (f>(i) ) ^ 

i=l 

Finding  the  optimal  coefficients  /3*  is  a  linear  least  squares  problem  that  can  be  solved 
by  efficient  linear  algebra  techniques.  The  sufficient  statistics  for  13*  are  the  matrix 
A  (of  dimension  K  x  K)  and  vector  b  (of  length  K),  computed  as  follows: 

N  N  . 

^  =  b  =  (3.8) 

*=1  i=l 

Given  this  compact  representation  of  the  training  set,  the  coefficients  of  the  linear  fit 
can  be  calculated  as 

f3*  =  A-^b  (3.9) 

Singular  Value  Decomposition  is  the  method  of  choice  for  inverting  A,  since  it  is 
robust  when  A  is  singular  [Press  et  al.  92]. 

On  each  iteration  of  STAGE,  many  new  training  samples  are  added  to  the  training 
set,  and  then  the  function  approximator  is  re-trained  once.  With  a  linear  architecture, 
adding  new  training  samples  is  simply  a  matter  of  incrementing  the  A  matrix  and  b 
vector  of  Equation  3.8.  For  each  sample,  this  incurs  a  cost  of  0{K^)  to  compute  the 
outer  product  and  0{K)  to  compute  (f>{i)yi-  The  training  samples  can  then 

be  discarded.  Updated  values  of  /3*  are  computed  by  Equation  3.9  in  0{K^)  time, 
independent  of  the  number  of  samples  in  the  training  set. 

Linear  architectures  also  have  the  advantage  of  low  memory  use.  Between  itera¬ 
tions  of  STAGE,  we  need  store  only  A  and  b,  not  the  whole  training  set.  Therefore, 
the  bottleneck  on  memory  usage  is  the  space  required  to  store  the  state  features 
along  any  single  trajectory  during  Step  2a  (refer  to  page  52).  This  is  usually  quite 
modest,  but  can  become  significant  for  local  search  procedures  which  visit  tens  of 
thousands  of  states  on  a  single  trajectory.  In  such  cases,  the  Least-Squares  TD(1) 
algorithm  can  be  used.  Least-Squares  TD(1)  produces  the  same  coefficients  (3*  as 
Equation  3.9  above  with  the  same  amount  of  computation,  but  requires  no  memory 
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for  saving  trajectories:  it  performs  its  computations  fully  incrementally  as  the  trajec¬ 
tory  is  generated.  Least-Squares  TD(1)  may  be  applied  in  STAGE  when  the  baseline 
local  search  procedure  tt  is  monotonic;  the  algorithm  is  described  in  its  general  TD(A) 
form  in  Section  6.1. 

3.4.4  Discussion 

This  section  has  considered  the  theoretical  and  computational  issues  that  arise  in 
choosing  a  local  search  procedure  tt ,  feature  mapping  F,  and  function  approximator 
Fit  for  use  by  STAGE.  To  summarize  the  practical  conclusions  of  this  section: 

1.  A  good  choice  for  tt  is  stochastic  hillclimbing,  rejecting  equi-cost  moves,  with 
patience-based  termination.  This  procedure  is  proper,  Markovian,  monotonic, 
and  easy  to  apply  in  almost  any  domain. 

2.  Features  F(x)  of  a  state  x  must  be  hand  chosen,  but  are  generally  abundant. 
A  good  choice  is  to  use  a  few  simple,  coarse  features  such  as  subcomponents  of 
Obj(ar). 

3.  A  good  choice  for  Fit  is  quadratic  regression.  Its  training  time  and  memory 
requirements  are  small  and  independent  of  the  number  of  training  samples. 

STAGE  has  two  further  inputs  that  have  not  yet  been  discussed:  N  and  Pat, 
the  neighborhood  structure  and  patience  parameter  used  for  stochastic  hillclimbing 
on  the  learned  evaluation  function  V'^.  In  general,  I  simply  set  these  to  the  same 
neighborhood  structure  and  patience  parameter  which  were  used  to  define  tt. 
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STAGE:  Empirical  Results 


The  last  chapter  introduced  STAGE,  an  optimization  algorithm  that  learns  to 
incorporate  extra  features  of  a  problem  into  an  evaluation  function  and  thereby  im¬ 
prove  overall  performance.  In  this  chapter,  I  first  define  my  methodology  for  measur¬ 
ing  stage’s  performance  and  comparing  it  to  other  algorithms  empirically.  I  then 
describe  my  implementations  of  seven  large-scale  optimization  domains  with  widely 
varying  characteristics: 

•  Bin-packing  (§4.2):  pack  a  collection  of  items  into  as  few  bins  as  possible — 
the  classic  NP-complete  problem,  as  discussed  in  the  examples  of  Chapter  3; 

•  Channel  routing  (§4.3):  minimize  the  area  needed  to  produce  a  specified 
circuit  in  VLSI; 

•  Bayes  net  structure-finding  (§4.4):  determine  the  optimal  graph  of  data 
dependencies  between  variables  in  a  dataset; 

•  Radiotherapy  treatment  planning  (§4.5):  given  an  anatomical  map  of  a 
patient’s  brain  tumor  and  nearby  sensitive  structures,  plan  a  minimally  harmful 
radiation  treatment; 

/ 

•  Cartogram  design  (§4.6):  for  geographic  visualization  purposes,  redraw  a 
map  of  the  United  States  so  that  each  state’s  area  is  proportional  to  its  popu¬ 
lation,  minimizing  deformations; 

•  Satisfiability  (§4.7):  given  a  Boolean  formula,  find  a  variable  assignment  that 
makes  the  formula  true;  and 

•  Boggle  board  setup  (§4.8):  find  a  5  x  5  grid  of  letters  containing  as  many 
English  words  in  connected  paths  as  possible. 

For  each  of  these  domains,  I  apply  STAGE  as  described  in  Section  3.2,  unmodified, 
and  report  statistically  significant  results. 
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4.1  Experimental  Methodology 

The  effectiveness  of  a  heuristic  optimization  algorithm  should  be  measured  along 
three  main  dimensions  [Barr  et  al.  95,  Johnson  96]: 

•  Solution  quality.  How  close  are  the  solutions  found  by  the  method  to  opti¬ 
mality? 

•  Computational  effort.  How  long  did  the  method  take  to  find  its  best  solu¬ 
tion?  How  quickly  does  it  find  good  solutions? 

•  Robustness.  Does  the  method  work  across  multiple  problem  instances  and 
varying  domains?  In  the  case  of  a  randomized  method,  are  its  results  consistent 
when  applied  repeatedly  to  the  same  problem? 

This  chapter  evaluates  the  performance  of  STAGE  in  each  of  these  dimensions  through 
a  statistical  analysis  of  the  results  of  thousands  of  experimental  runs.  STAGE’S  results 
are  contrasted  to  the  optimal  solution  (if  available),  special-purpose  algorithms  for 
each  domain  (if  available),  and  two  general-purpose  reference  algorithms:  multiple- 
restart  stochastic  hillclimbing  and  simulated  annealing. 

4.1.1  Reference  Algorithms 

I  compare  STAGE’S  performance  to  that  of  multi-restart  stochastic  hillclimbing  and 
simulated  annealing  on  every  problem  instance.  Multi-restart  stochastic  hillclimbing 
is  not  only  straightforward  to  implement,  but  also  a  surprisingly  effective  heuristic 
on  some  domains.  For  example,  hillclimbing  with  100  random  restarts  is  gener¬ 
ally  adequate  for  finding  high-quality  solutions  to  a  geometric  line-matching  prob¬ 
lem  [Beveridge  et  al.  96].  For  the  comparative  experiments  of  this  chapter,  I  run 
hillclimbing  with  the  following  parameters: 

•  At  each  step,  moves  are  randomly  chosen  in  a  problem-dependent  way.  The 
first  such  move  that  either  improves  Obj  or  keeps  it  the  same  is  accepted. 

•  The  patience  parameter  is  set  individually  for  each  problem;  generally  the  best 
setting  is  on  the  same  order  of  magnitude  as  the  number  of  available  actions. 
Search  restarts  whenever  patience  consecutive  moves  have  been  evaluated  since 
finding  the  state  whose  objective  function  value  is  best  on  the  current  trajectory. 

•  Search  restarts  at  either  a  special  start  state  or  a  random  state;  this  distribution 
is  also  set  individually  for  each  domain. 
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•  The  total  number  of  moves  evaluated  is  limited  to  TotEvals,  the  same  pa¬ 
rameter  that  governs  STAGE’S  termination. 

Conceptually,  simulated  annealing  is  only  slightly  more  complicated:  it  accepts 
all  improving  and  equi-cost  moves,  but  also  probabilistically  accepts  some  worsening 
moves.  In  practice,  though,  implementing  an  effective  temperature  annealing  schedule 
which  regulates  the  evolution  of  the  acceptance  probabilities  can  be  difficult.  Based 
on  a  review  of  the  literature  and  a  good  deal  of  experimentation,  I  implemented 
Swartz’s  modification  of  the  Lam  temperature  schedule  [Swartz  and  Sechen  90,  Lam 
and  Delosme  88] .  This  adaptive  schedule  produces  excellent  results  across  a  variety  of 
domains.  Appendix  B  provides  detailed  justification  for  and  implementation  details 
of  this  “modified  Lam”  schedule. 

4.1.2  How  the  Results  are  Tabulated 

For  the  purpose  of  summarizing  the  solution  quality,  computational  effort,  and  robust¬ 
ness  of  STAGE  and  competing  heuristics,  it  would  be  ideal  to  plot  each  algorithm’s 
average  performance  versus  elapsed  time.  However,  my  experiments  were  run  on 
a  pool  of  over  100  workstations  having  widely  varying  job  loads,  processor  speeds, 
and  system  architectures;  collecting  meaningful  average  timing  measurements  was 
therefore  impossible.  Instead,  for  each  algorithm  I  plot  average  performance  ver¬ 
sus  number  of  moves  considered.  In  practical  applications  where  evaluating  Obj  is 
relatively  costly,  this  quantity  correlates  strongly  with  total  running  time,  yet  is  inde¬ 
pendent  of  machine  speed  and  load.  I  also  do  give  sample  timing  comparisons  on  each 
domain,  measured  on  an  SGI  Indigo2  RIOOOO  workstation,  but  since  this  workstation 
is  multi-user,  these  figures  should  be  considered  very  rough.  Each  time  reported  is 
the  median  of  three  independent  runs. 

Note  that  for  STAGE,  the  number  of  moves  considered  includes  moves  made 
during  both  stages  of  the  algorithm,  i.e.,  both  running  tt  and  optimizing  V'^ .  However, 
this  number  does  not  capture  STAGE’s  additional  overhead  for  feature  construction 
and  function  approximator  training.  With  linear  approximation  architectures  (see 
Section  3.4.3)  and  simple  features,  this  overhead  is  minimal — typically,  less  than  10% 
of  the  execution  time. 

On  any  single  run  of  a  search  algorithm,  after  n  moves  have  been  considered,  the 
performance  is  defined  as  the  best  Obj  value  found  up  to  that  point.  The  overall 
performance  of  the  algorithm  at  time  n,  (^(n),  is  defined  as  the  expected  single-run 
performance: 


Q{n)  E{  min  Obj  (a;,)) 

S=0...n  ^ 
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where  the  expectation  is  taken  over  all  trajectories  that  the  algorithm 

may  generate  on  the  problem.  For  experimental  evaluation  purposes,  I  sample  Q{n) 
by  taking  the  mean  best  value  seen  over  multiple  independent  runs.  The  variable  n 
ranges  from  0  to  TotEvals.  Note  that  Q{n)  is  non-increasing  regardless  of  whether 
the  search  algorithm  monotonically  improves  Obj. 

Figure  4.1  illustrates  how  a  performance  curve  is  produced  for  an  algorithm  and 
a  problem.  In  this  example,  a  search  algorithm  is  run  five  times  for  TotEvals  = 
500  steps  each  run.  The  upper-left  graph  plots  the  objective  function  value  of  the 
states  visited  during  the  course  of  these  five  runs.  The  upper-right  graph  plots  the 
performance  of  each  run,  that  is,  the  best  Obj  value  seen  so  far  at  each  step.  The  third 
graph,  at  bottom  left,  plots  the  mean  performance  of  the  five  runs  over  time.  Finally, 
the  fourth  graph  summarizes  the  algorithm’s  overall  performance  in  a  boxplot.  The 
boxplot  is  calculated  at  the  endpoint  of  the  runs,  where  n  =  TotEvaLS.  It  gives  the 
95%  confidence  interval  of  the  mean  performance  (shown  as  a  box  around  the  mean)^ 
and  the  end  result  of  the  best  and  worst  of  the  runs  (shown  as  “whiskers”). 

For  each  comparative  experiment  in  this  chapter,  the  results  are  presented  in  two 
figures  and  a  table: 

•  a  mean  performance  curve,  like  the  lower  left  graph  of  Figure  4.1,  indicates  how 
quickly  each  algorithm  reached  its  best  performance  level; 

•  a  boxplot,  like  the  lower  right  graph  of  Figure  4.1,  provides  a  useful  visual  means 
for  comparing  the  algorithms  against  one  another  [Barr  et  al.  95];  and 

•  a  results  table  numerically  displays  the  same  data  as  the  boxplot.  In  each  table, 
the  best  minimum,  best  maximum,  and  statistically  best  mean  performances 
are  boldfaced.  Each  table  also  reports  the  running  time  and  overall  percentage 

"“totEvals*^^  each  algorithm,  but  these  figures  are  medians  of  only  three 
runs  and  should  thus  be  considered  rough  estimates. 

I  now  proceed  to  describe  the  various  optimization  domains,  experiments,  and 
results  by  which  I  have  evaluated  STAGE. 


4.2  Bin-packing 

Bin-packing,  a  classical  NP-hard  optimization  problem  [Garey  and  Johnson  79], 
has  often  been  used  as  a  testbed  for  combinatorial  optimization  algorithms  (e.g., 

^The  95%  confidence  interval  of  the  mean  is  simply  2  standard  errors  on  either  side:  //  ± 
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Figure  4.1.  From  a  number  of  independent  runs  of  an  algorithm  (upper  left)  and 
their  best-so-far  curves  (upper  right),  a  single  performance  curve  (lower  left)  and  box 
plot  (lower  right)  are  generated  to  summarize  the  algorithm’s  performance.  Please 
refer  to  the  text  for  a  detailed  explanation. 
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[Falkenauer  96,Baluja  and  Davies  97]).  The  problem  was  introduced  earlier  in  Chap¬ 
ter  3.  Recall  that  we  are  given  a  bin  capacity  C  and  a  list  L  -  {ai,a2,  ...a„)  of  items, 
each  having  a  size  s{ai)  >  0.  The  goal  is  to  pack  the  items  into  as  few  bins  as  possible, 

i.e.,  partition  them  into  a  minimum  number  m  of  subsets  Bi,B2, Bm  such  that  for 
each  Bj,  EaieB,  <  C- 

In  this  section,  I  present  results  on  benchmark  problem  instances  from  the  Opera¬ 
tions  Research  Library  (see  Appendix  C.l  for  details).  The  first  instance  considered, 
u250_13  [Falkenauer  96],  has  250  items  of  sizes  uniformly  distributed  in  (20, 100)  to 
be  packed  into  bins  of  capacity  150.  The  item  sizes  sum  to  15294,  so  a  lower  bound 
on  the  number  of  bins  required  is  [^H^^]  =  102. 

Falkenauer  reported  excellent  results — on  this  problem,  a  solution  with  only  103 
bins — using  a  specially  modified  search  procedure  termed  the  “Grouping  Genetic 
Algorithm”  and  a  hand-tuned  objective  function: 

We  thus  settled  for  the  following  cost  function  for  the  BPP  [binpacking 
problem]:  maximize  /epp  =  with  m  being  the  number  of  bins 

used,  fillj  the  sum  of  sizes  of  the  objects  in  the  bin  i,  C  the  bin  capacity  and 
k  a  constant,  k  >  1....  The  constant  k  expresses  our  concentration  on  the 
well-filled  “elite”  bins  in  comparison  to  the  less  filled  ones.  Should  k  =  1, 
only  the  total  number  of  bins  used  would  matter,  contrary  to  the  remark 
above.  The  larger  k  is,  the  more  we  prefer  the  “extremists”  as  opposed 
to  a  collection  of  equally  filled  bins.  We  have  experimented  with  several 
values  of  k  and  found  out  that  k  =  2  gives  good  results.  Larger  values 
of  k  seem  to  lead  to  premature  convergence  of  the  algorithm,  as  the  local 
optima,  due  to  a  few  well-filled  bins,  are  too  hard  to  escape.  [Falkenauer 
and  Delchambre  92,  notation  edited  for  consistency] 

STAGE  requires  neither  the  complex  “group-oriented”  genetic  operators  of  Falke- 
nauer’s  encoding,  nor  any  hand-tuning  of  the  cost  function.  Rather,  it  uses  natural 
local-search  operators  on  the  space  of  legal  solutions.  A  solution  state  x  simply  as¬ 
signs  a  bin  number  6(a,)  to  each  item.  Each  item  is  initially  placed  alone  in  a  bin: 
6(ai)  =  1,6(02)  =  2,...  ,6(a„)  =  n.  Neighboring  states  are  generated  by  moving  a 
single  item  c,,  as  follows: 

1.  Let  B  be  the  set  of  bins  other  than  6(a,)  that  are  non-empty  but  still  have 
enough  spare  capacity  to  accommodate  a,; 

2.  \i  B  =  0,  then  move  a,  to  an  empty  bin; 

3.  Otherwise,  move  a,  to  a  bin  selected  randomly  from  B. 
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Note  that  hillclimbing  methods  always  reject  moves  of  type  (2),  which  add  a  new  bin; 
and  that  if  equi-cost  moves  are  also  rejected,  then  the  only  accepted  moves  will  be 
those  that  empty  a  bin  by  placing  a  singleton  item  in  an  occupied  bin. 

The  objective  function  STAGE  is  given  to  minimize  is  simply  Obj(x)  =  m,  the 
number  of  bins  used.  There  is  no  need  to  tune  the  evaluation  function  manually.  For 
automatic  learning  of  its  own  secondary  evaluation  function,  STAGE  is  provided  with 
two  state  features,  just  as  in  the  bin-packing  example  of  section  3.3.2  (p.  54): 

•  Feature  1:  Obj(a;),  the  number  of  bins  used  by  solution  x 

•  Feature  2:  Var(a:),  the  variance  in  bin  fullness  levels 

This  second  feature  provides  STAGE  with  information  about  the  proportion  of  “ex¬ 
tremist”  bins,  similar  to  that  provided  by  Falkenauer’s  cost  function.  STAGE  then 
learns  its  evaluation  function  by  quadratic  regression  over  these  two  features. 

The  remaining  parameters  to  STAGE  are  set  as  follows:  the  patience  paxameters 
are  set  to  250  and  the  ObjBound  cutoff  is  disabled  (set  to  — oo).  In  a  few  informal 
experiments,  varying  these  parameters  had  a  negligible  effect  on  the  results.  Table  4.1 
lists  all  of  stage’s  parameter  settings. 


Parameter 

Setting 

TT 

stochastic  hillclimbing,  rejecting  equi-cost  moves,  patience=250 

ObjBound 

—  OO 

features 

2  (number  of  bins  used,  variance  of  bin  fullness  levels) 

fitter 

quadratic  regression 

Pat 

250 

TotEvals 

100,000 

Table  4.1.  Summary  of  STAGE  parameters  for  bin-packing  results.  (For  descrip¬ 
tions  of  the  parameters,  see  Section  3.2.2.) 


stage’s  performance  is  contrasted  with  that  of  four  other  algorithms: 

•  HCO:  multi-restart  stochastic  hillclimbing  with  equi-cost  moves  rejected,  pa- 
tience=1000.  On  each  restart,  seaxch  begins  at  the  initial  state  which  has  each 

.  item  in  its  own  bin. 

•  HCl:  the  same,  but  with  equi-cost  moves  accepted. 

•  SA:  simulated  annealing,  as  described  in  Appendix  B. 
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•  BFR:  multi-restart  “best-fit-randomized,”  a  simple  bin-packing  algorithm  with 
good  worst-case  performance  bounds  [Kenyon  96,  Coffman  et  al  96].  BFR  be¬ 
gins  with  all  bins  empty  and  a  random  permutation  of  the  list  of  items.  It  then 
successively  places  each  item  into  the  fullest  bin  that  can  accommodate  it,  or 
a  new  empty  bin  if  no  non-empty  bin  has  room.  When  all  items  have  been 
placed,  BFR  outputs  the  number  of  bins  used.  The  process  then  repeats  with 
a  new  random  permutation  of  the  items. 

All  algorithms  are  limited  to  100,000  total  moves.  (For  BFR,  each  random  permuta¬ 
tion  tried  counts  as  a  single  move.) 
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Figure  4.2.  Bin-packing  performance 


The  results  of  100  runs  of  each  algorithm  are  summarized  in  Table  4.2  and  dis¬ 
played  in  Figure  4.2.  Stochastic  hillclimbing  rejecting  equi-cost  moves  (HCO)  is  clearly 
the  weakest  competitor  on  this  problem;  as  pointed  out  earlier,  it  gets  stuck  at  the  first 
solution  in  which  each  bin  holds  at  least  two  items.  With  equi-cost  moves  accepted 
(HCl),  hillclimbing  explores  much  more  effectively  and  performs  almost  as  well  as 
simulated  annealing.  Best-fit-randomized  performs  even  better.  However,  STAGE — 
building  itself  a  new  evaluation  function  by  learning  to  predict  the  behavior  of  HCO, 
the  weakest  algorithm — significantly  outperforms  all  the  others.  Its  mean  solution 
quality  is  under  105  bins,  and  on  one  of  the  100  runs,  it  equalled  the  best  solution 
(103  bins)  found  by  Falkenauer’s  specialized  bin-packing  algorithm  [Falkenauer  96]. 
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The  best-so-far  curves  show  that  STAGE  learns  quickly,  achieving  good  performance 
after  only  about  10000  moves,  or  about  4  iterations  on  average. 

stage’s  timing  overhead  for  learning  was  on  the  order  of  only  7%  over  HCO. 
In  fact,  the  timing  differences  between  SA,  HCl,  HCO,  and  STAGE  are  attributable 
mainly  to  their  different  ratios  of  accepted  to  rejected  moves:  rejected  moves  are 
slower  in  my  bin-packing  implementation  since  they  rnust  be  undone.  Since  SA  ac¬ 
cepts  most  moves  it  considers  early  in  search,  it  finishes  slightly  more  quickly.  The 
BFR  runs  took  much  longer,  but  the  performance  curve  makes  it  clear  that  its  best 
solutions  were  reached  quickly.^ 


Instance 

Algorithm 

Performance  (100  runs  each) 

moves 

mean 

best 

worst 

time 

accepted 

u250_13 

HCO 

119.68±0.17 

117 

121 

12.4s 

8% 

HCl 

109.38±0.10 

108 

no 

11.4s 

71% 

SA 

108.19±0.09 

107 

109 

11.1s 

44% 

BFR 

106.10±0.07 

105 

107 

95.3s 

— 

STAGE 

104.60±0.11 

103 

106 

13.3s 

6% 

Table  4.2.  Bin-packing  results 


As  a  follow-up  experiment,  I  ran  HCl,  SA,  BFR,  and  STAGE  on  all  20  bin-packing 
problem  instances  in  the  u250  class  of  the  OR-Library  (see  Appendix  C.l).  All  runs 
used  the  same  settings  shown  in  Table  4.1.  The  results,  given  in  Table  4.3,  show 
that  STAGE  consistently  found  the  best  packings  in  each  case.  STAGE’S  average 
improvements  over  HCl,  SA,  and  BFR  were  5.0±0.3  bins,  3.8±0.3  bins,  and  1.6±0.3 
bins,  respectively. 

How  did  STAGE  succeed?  The  STAGE  runs  followed  the  same  pattern  as  the  runs 
on  the  small  example  bin-packing  instance  of  last  chapter  (§3.3.2).  STAGE  learned  a 
secondary  evaluation  function,  that  successfully  traded  off  between  the  original 
objective  and  the  additional  bin-variance  feature  to  identify  promising  start  states. 
A  typical  evaluation  function  learned  by  STAGE  is  plotted  in  Figure  4.3.^  As  in 

^Falkenauer  reported  a  running  time  of  6346  seconds  for  his  genetic  algorithm  to  find  the  global 
optimum  on  this  instance  [Falkenauer  96],  though  this  was  measured  on  an  SGI  R4000  and  our  times 
were  measured  on  an  SGI  RIOOOO. 

®The  particular  plotted  is  a  snapshot  from  iteration  #15  of  a  STAGE  run,  immediately  after 
the  solution  Obj(a:)  =  103  was  found.  The  learned  coefficients  are 

E’^(Obj,Var)  =  -99.1  +  636  Var  +  3462  Var^  +  2.64  Obj  -  9.03  Obj  •Var-0.00642  Obj^. 
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Inst. 

Alg. 

Perform  ar 

mean 

ice  (25 
best 

runs) 

worst 

Inst. 

Alg. 

Perform  ar 

mean 

ice  (25 
best 

runs) 

worst 

u250_00 

HCl 

105.9±0.2 

105 

107 

U250-10 

HCl 

ni.8±0.2 

111 

112 

SA 

104.7±0.2 

104 

105 

SA 

n0.4±0.2 

no 

111 

BFR 

102.1±0.1 

102 

103 

BFR 

108.2±0.1 

108 

109 

STAGE 

100.8±0.2 

100 

102 

STAGE 

T06.8±0.2 

106 

108 

u250_01 

HCl 

106.4±0.2 

105 

107 

u250_n 

HCl 

108.2±0.2 

107 

109 

SA 

105.0±0.2 

104 

106 

SA 

107.0±0.1 

106 

108 

BFR 

103.0±0.1 

102 

103 

BFR 

104.9±0.1 

104 

105 

STAGE 

101.2±0.2 

100 

103 

STAGE 

103.0±0.2 

102 

105 

U250-02 

HCl 

108.8±0.2 

108 

109 

U250-12 

HCl 

n2.4±0.2 

111 

113 

SA 

107.6±0.3 

106 

108 

SA 

ni.2±0.2 

no 

112 

BFR 

105.1±0.1 

105 

106 

BFR 

109.2±0.2 

109 

no 

STAGE 

103.9±0.5 

103 

109 

STAGE 

107.3±0.2 

106 

108 

u250_03 

HCl 

106.2±0.2 

105 

107 

u250_13 

HCl 

109.3±0.2 

109 

no 

SA 

105.3±0.2 

105 

106 

SA 

108.2±0.2 

108 

109 

BFR 

103.1±0.1 

103 

104 

BFR 

106.2±0.1 

106 

107 

STAGE 

101.6±0.2 

101 

102 

STAGE 

104.5±0.2 

104 

105 

U250.04 

HCl 

107.6±0.2 

106 

108 

u250_14 

HCl 

106.4±0.2 

106 

107 

SA 

106.8±0.2 

106 

107 

SA 

105.2±0.2 

105 

106 

BFR 

104.0±0.1 

104 

105 

BFR 

103.1±0.1 

103 

104 

STAGE 

102.7±0.2 

102 

103 

STAGE 

101.3±0.2 

100 

102 

U250-05 

HCl 

108.0±  0 

108 

108 

u250_15 

HCl 

n2.0±0.2 

111 

113 

SA 

106.8±0.1 

106 

107 

SA 

ni.oio.i 

no 

112 

BFR 

105.0±0.1 

105 

106 

BFR 

109.0±0.1 

108 

no 

STAGE 

103.1±0.2 

102 

104 

STAGE 

107.0±0.1 

107 

108 

u250_06 

HCl 

108.1±0.2 

107 

109 

u250_16 

HCl 

103.8±0.2 

103 

104 

SA 

106.8±0.2 

106 

107 

SA 

102.4±0.2 

102 

103 

BFR 

105.0±0.1 

104 

106 

BFR 

lOO.OiO.l 

100 

101 

STAGE 

102.8±0.2 

102 

104 

STAGE 

98.7±0.2 

98 

99 

U250-07 

HCl 

no.4±o.2 

no 

111 

U250-17 

HCl 

106.1db0.1 

106 

107 

SA 

109.0±0.1 

109 

no 

SA 

104.9±0.2 

104 

106 

BFR 

107.0±0.1 

107 

108 

BFR 

103.0±0.1 

103 

104 

STAGE 

105.1±0.1 

105 

106 

STAGE 

101.1±0.2 

100 

102 

U250-08 

HCl 

111.9±0.1 

111 

112 

U250-18 

HCl 

107.0±0.2 

106 

108 

SA 

111.1±0.1 

111 

112 

SA 

105.8±0.1 

105 

106 

BFR 

109.1±0.1 

109 

no 

BFR 

103.0±  0 

103 

103 

STAGE 

107.4±0.2 

106 

108 

STAGE 

101.9±0.3 

101 

104 

U250-09 

HCl 

107.7±0.2 

107 

108 

u250_19 

HCl 

108.4±0.2 

108 

109 

SA 

106.1±0.1 

106 

107 

SA 

107.5±0.2 

107 

108 

BFR 

104.1±0.1 

104 

105 

BFR 

105.1±0.1 

105 

106 

STAGE 

102.5±0.3 

101 

104 

STAGE 

103.8±0.2 

103 

105 

Table  4.3.  Bin-packing  results  on  20  problem  instances  from  the  OR-Library 
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the  example  instance  of  last  chapter  (Figure  3.7),  STAGE  learns  to  direct  the  search 
toward  the  high-variance  states  from  which  hillclimbing  is  predicted  to  excel. 


Vpi  (iteration  #15) 

200-  105 


110 


Figure  4:3.  An  evaluation  function  learned  by  STAGE  on  bin-packing  instance 
u250_13.  STAGE  learns  that  the  states  with  higher  variance  (wider  arcs  of  the  contour 
plot)  are  promising  start  states  for  hillclimbing. 


4.3  VLSI  Channel  Routing 

The  problem  of  “Manhattan  channel  routing”  is  an  important  subtask  of  VLSI  circuit 
design  [Deutsch  76,Yoshimura  and  Kuh  82,  Wong  et  al.  88,  Chao  and  Harper  96,Wilk 
96].  Given  two  rows  of  labelled  terminals  across  a  gridded  rectangular  channel,  we 
must  connect  like-labelled  pins  to  one  another  by  placing  wire  segments  into  vertical 
and  horizontal  tracks  (see  Figure  4.4).  Segments  may  cross  but  not  otherwise  overlap. 
The  objective  is  to  minimize  the  area  of  the  channel’s  rectangular  bounding  box — or 
equivalently,  to  minimize  the  number  of  different  horizontal  tracks  needed. 

Channel  routing  is  known  to  be  NP-complete  [Szymanski  85].  Specialized  algo¬ 
rithms  based  on  branch-and-bound  or  A*  search  techniques  have  made  exact  solu¬ 
tions  attainable  for  some  benchmarks  [Lin  91].  However,  larger  problems  still  can  be 
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814214395765 


926435408977 


Figure  4.4.  A  small  channel  routing  instance,  shown  with  a  solution  occupying  7 
horizontal  tracks. 

solved  only  approximately  by  heuristic  techniques,  e.g.  [Wilk  96].  My  implementa¬ 
tion  is  based  on  the  SACK  system  [Wong  et  al.  88,  Chapter  4],  a  simulated  annealing 
approach.  SACR’s  operator  set  is  sophisticated,  involving  manipulations  to  a  parti¬ 
tioning  of  vertices  in  an  acyclic  constraint  graph.  If  the  partitioning  meets  certain 
additional  constraints,  then  it  corresponds  to  a  legal  routing,  and  the  number  of 
partitions  corresponds  to  the  channel  size  we  are  trying  to  minimize. 

Like  Falkenauer’s  bin-packing  implementation  described  above,  Wong’s  channel 
routing  implementation  required  manual  objective  function  tuning: 

Clearly,  the  objective  function  to  be  minimized  is  the  channel  width  w{x). 
However,  w;(a;)  is  too  crude  a  measure  of  the  quality  of  intermediate  so¬ 
lutions.  Instead,  for  any  valid  partition  x,  the  following  cost  function  is 
used: 

C{x)  =  w{xy  +  Xp  ■p{x)^  +  \u -Uix)  (4.1) 

where  p{x)  is  the  longest  path  length  of  [a  graph  induced  by  the 
partitioning],  both  Xp  and  Xu  are  constants,  and  ...  U{x)  =  ■> 

where  Ui{x)  is  the  fraction  of  track  i  that  is  unoccupied.  [Wong  et  al.  88, 
notation  edited  for  consistency] 
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They  hand-tuned  the  coefficients  and  set  Xp  =  0.5,  Xu  =  10.  To  apply  STAGE 
to  this  problem,  I  began  with  not  the  contrived  function  C{x)  but  the  natural  ob¬ 
jective  function  Obj(a;)  =  w{x).  The  additional  objective  function  terms  used  in 
Equation  4.1,  p(x)  and  along  with  w{x)  itself,  were  given  as  the  three  input 

features  to  STAGE’S  function  approximator.  Thus,  the  features  of  a  solution  x  are 

•  Feature  1:  =  channel  width,  i.e.  the  number  of  horizontal  tracks  used  by 

the  solution. 

•  Feature  2:  p{x)  =  the  length  of  the  longest  path  in  a  “merged  constraint  graph” 
Gx  representing  the  solution.  This  feature  is  a  lower  bound  on  the  channel  width 
of  all  solutions  derived  from  x  by  merging  subnets  [Wong  et  al  88].  In  other 
words,  this  feature  bounds  the  quality  of  solution  that  can  be  reached  from  x 
by  repeated  application  of  a  restricted  class  of  operators,  namely,  merging  the 
contents  of  two  tracks  into  one.  The  inherently  predictive  nature  of  this  feature 
suits  STAGE  well. 

•  Feature  3:  U{x)  =  the  sparseness  of  the  horizontal  tracks,  measured  by 

where  Ui{x)  is  the  fraction  of  track  i  that  is  unoccupied.  Note  that  this  feature 
is  real-valued,  whereas  the  other  two  are  discrete;  and  that  0  <  U{x)  <  w(x). 


Table  4.4  summarizes  the  remaining  STAGE  parameter  settings. 


Parameter 

Setting 

TT 

stochastic  hillclimbing,  rejecting  equi-cost  moves,  patience=250 

ObjBound 

— oo 

features 

3  (u;(a:),p(a;),C/(a;)) 

fitter 

linear  regression 

Pat 

250 

TotEvals 

500,000 

Table  4.4.  Summary  of  STAGE  parameters  for  channel  routing  results 


stage’s  performance  is  contrasted  with  that  of  four  other  algorithms: 

•  HCO:  multi-restart  stochastic  hillclimbing  with  equi-cost  moves  rejected,  pa- 
tience=400.  On  each  restart,  search  begins  at  the  initial  state  which  has  each 
subnet  on  its  own  track. 

•  HCl:  the  same,  but  with  equi-cost  moves  accepted. 
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•  SAW:  simulated  annealing,  using  the  hand-tuned  objective  function  of  Equa¬ 
tion  4.1  [Wong  et  al.  88]. 

•  SA:  simulated  annealing,  using  the  true  objective  function  Obj(a;)  =  w{x). 

•  HCI:  stochastic  hillclimbing  with  equi-cost  moves  accepted,  patience=oo  (i.e., 
no  restarting). 

All  algorithms  were  limited  to  500,000  total  moves.  Results  on  YK4,  an  instance 
with  140  vertical  tracks,  are  given  in  Figure  4.5  and  Table  4.5.  By  construction  (see 
Appendix  C.2  for  details),  the  optimal  routing  x*  for  this  instance  occupies  only  10 
horizontal  tracks,  i.e.  Obj(x*)  =  10.  A  12-track  solution  is  depicted  in  Figure  4.6. 


Figure  4.5.  Channel  routing  performance  on  instance  YK4 


None  of  the  local  search  algorithms  successfully  finds  an  optimal  10-track  solution. 
Experiments  HCO  and  HCl  show  that  multi-restart  hillclimbing  performs  terribly 
when  equi-cost  moves  are  rejected,  but  significantly  better  when  equi-cost  moves 
are  accepted.  Experiment  SAW  shows  that  simulated  annealing,  as  used  with  the 
objective  function  of  [Wong  et  al.  88],  does  considerably  better.  Surprisingly,  the 
annealer  of  Experiment  SA  does  better  still.  It  seems  that  the  “crude”  evaluation 
function  Obj(a:)  =  w{x)  allows  a  long  simulated  annealing  run  to  effectively  random- 
walk  along  the  ridge  of  all  solutions  of  equal  cost,  and  given  enough  time  it  will 
fortuitously  find  a  hole  in  the  ridge.  In  fact,  increasing  hillclimbing’s  patience  to  oo 
(disabling  restarts)  worked  nearly  as  well. 
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Instance 

Algorithm 

Performance  (100  runs  each) 

rsj 

moves 

mean 

best 

worst 

time 

accepted 

YK4 

HCO 

41.17±0.20 

38 

43 

212s 

8% 

HCl 

22.35±0.19 

20 

24 

200s 

80% 

SAW 

16.49±0.16 

14 

19 

245s 

32% 

SA 

14.32±0.10 

13 

15 

292s 

57% 

HCI 

14.69±0.12 

13 

16 

350s 

58% 

STAGE 

12.42±0.11 

11 

14 

405s 

5% 

Table  4.5.  Channel  routing  results 


Figure  4.6.  A  12-track  solution  found  by  STAGE  on  instance  YK4 
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STAGE  performs  significantly  better  than  all  of  these.  How  does  STAGE  learn 
to  combine  the  features  w(x),  p{x),  and  U{x)  into  a  new  evaluation  function  that 
outperforms  simulated  annealing?  I  have  investigated  this  question  extensively;  the 
analysis  is  reported  in  Chapter  5.  Later,  in  Section  6.2,  I  also  report  the  results  of 
transferring  STAGE’S  learned  evaluation  function  between  different  channel  routing 
instances. 

The  disparity  in  the  running  times  of  the  algorithm  deserves  explanation.  The 
STAGE  runs  took  about  twice  as  long  to  complete  as  the  hillclimbing  runs,  but 
this  is  not  due  to  the  overhead  for  STAGE’S  learning:  linear  regression  over  three 
simple  features  is  extremely  cheap.  Rather,  STAGE  is  slower  because  when  the  search 
reaches  good  solutions,  the  process  of  generating  a  legal  move  candidate  becomes 
more  expensive;  STAGE  is  victimized  by  its  own  success.  STAGE  and  HCI  are  also 
slowed  relative  to  SA  because  they  reject  many  more  moves,  forcing  extra  “undo” 
operations.  In  any  event,  the  performance  curve  of  Figure  4.5  indicates  that  halving 
any  algorithm’s  running  time  would  not  affect  its  relative  performance  ranking. 


4.4  Bayes  Network  Learning 

Given  a  dataset,  an  important  data  mining  task  is  to  identify  the  Bayesian  network 
structure  that  best  models  the  probability  distribution  of  the  data  [Mitchell  97,Heck- 
erman  et  al.  94,  Friedman  and  Yakhini  96].  The  problem  amounts  to  finding  the 
best-scoring  acyclic  graph  structure  on  A  nodes,  where  A  is  the  number  of  attributes 
in  each  data  record. 

Several  scoring  metrics  are  common  in  the  literature,  including  metrics  based  on 
Bayesian  analysis  [Chickering  et  al.  94]  and  metrics  based  on  Minimum  Description 
Length  (MDL)  [Lam  and  Bacchus  94,  Friedman  97].  I  use  the  MDL  metric,  which 
trades  off  between  maximizing  fit  accuracy  and  minimizing  model  complexity.  The 
objective  function  decomposes  into  a  sum  over  the  nodes  of  the  network  x\ 


A 

Obj(a:)  =  E(  — Fitness(a:j)  4-  K  •  Gomplexity(a:j))  (4i2) 

j=i 


Following  Friedman  [96],  the  Fitness  term  computes  a  mutual  information  score  at 
each  node  Xj  by  summing  over  all  possible  joint  assignments  to  variable  j  and  its 
parents: 


Fitness(a:j)  =  J]  N{vj  A  Vv^.)\og 


N{vj  A  Vparj  ) 
iV(Vpar,) 
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Here,  N{-)  refers  to  the  number  of  records  in  the  database  that  match  the  specified 
variable  assignment.  I  use  the  AEtree  data  structure  to  make  calculating  efficient 
[Moore  and  Lee  98]. 

The  Complexity  term  simply  counts  the  number  of  parameters  required  to  store 
the  conditional  probability  table  at  node  j: 

Complexity(a;j)  =  (Arity(i)  —  l)  jQ  Arity(i) 

i^Pavj 

The  constant  K  in  Equation  4.2  is  set  to  log(i?)/2,  where  R  is  the  number  of  records 
in  the  database  [Friedman  97]. 

No  efficient  methods  are  known  for  finding  the  acyclic  graph  structure  x  which 
minimizes  Obj(a;);  indeed,  for  Bayesian  scoring  metrics,  the  problem  has  been  shown 
to  be  NP-hard  [Chickering  et  al.  94],  and  a  similar  reduction  probably  applies  for  the 
MDL  metric  as  well.  Thus,  multi-restart  hillclimbing  and  simulated  annealing  are 
commonly  applied  [Heckerman  et  al.  94,  Friedman  97].  My  search  implementation 
works  as  follows.  To  ensure  that  the  graph  is  acyclic,  a  permutation  Xjj,  . . .  ,a;j^ 
on  the  A  nodes  is  maintained,  and  all  links  in  the  graph  are  directed  from  nodes  of 
lower  index  to  nodes  of  higher  index.  Local  search  begins  from  a  linkless  graph  on 
the  identity  permutation.  The  following  move  operators  then  apply; 

•  With  probability  0.7,  choose  two  random  nodes  of  the  network  and  add  a  link 
between  them  (if  that  link  isn’t  already  there)  or  delete  the  link  between  them 
(otherwise). 

•  With  probability  0.3,  swap  the  permutation  ordering  of  two  random  nodes  of 
the  network.  Note  that  this  may  cause  multiple  graph  edges  to  be  reversed. 

Obj  can  be  updated  incrementally  after  a  move  by  recomputing  Fitness  and  Complexity 
at  only  the  affected  nodes. 

For  learning,  STAGE  was  given  the  following  seven  extra  features: 

•  Features  1-2:  mean  and  standard  deviation  of  Fitness  over  all  the  nodes 

•  Features  3-4:  mean  and  standard  deviation  of  Complexity  over  all  the  nodes 

•  Features  5-6:  mean  and  standard  deviation  of  the  number  of  parents  of  each 
node 

•  Feature  7:  the  number  of  “orphan”  nodes 
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Figure  4.7.  The  SYI\ITH125K  dataset  was  generated  by  this  Bayes  net  (from  [Moore 
and  Lee  98]).  All  24  attributes  are  binary.  There  are  three  kinds  of  nodes.  The  nodes 
marked  with  triangles  are  generated  with  P{ai  =  0)  =  0.8,  P{ai  =  1)  =  0.2.  The 
square  nodes  are  deterministic.  A  square  node  takes  value  1  if  the  sum  of  its  four 
parents  is  even,  else  it  takes  value  0.  The  circle  nodes  are  probabilistic  functions 
of  their  single  parent,  defined  by  P{ai  =  1  |  Parent  =  0)  =  0  and  P{ai  =  1  | 
Parent  1)  0.4.  This  provides  a  dataset  with  fairly  sparse  values  and  with  many 

interdependencies. 
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Figure  4.8.  A  network  structure  learned  by  a  sample  run  of  STAGE  from  the 
SYNTH  125K  dataset.  Its  Obj  score  is  719074.  By  comparison,  the  actual  network 
that  was  used  to  generate  the  data  (shown  earlier  in  Figure  4.7)  scores  718641.  Only 
two  edges  from  the  generator  net  are  missing  from  the  learned  net.  The  learned  net 
includes  17  edges  not  in  the  generator  net  (shown  as  curved  arcs). 
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I  applied  STAGE  to  three  datasets;  MPG,  a  small  dataset  consisting  of  392  records 
of  10  attributes  each;  ADULT2,  a  large  real-world  dataset  consisting  of  30,162  records 
of  15  attributes  each;  and  SYNTH125K,  a  synthetic  dataset  consisting  of  125,000 
records  of  24  attributes  each.  The  synthetic  dataset  was  generated  by  sampling  from 
the  Bayes  net  depicted  in  Figure  4.7.  A  perfect  reconstruction  of  that  net  would 
receive  a  score  of  Obj(x)  =  718641.  For  further  details  of  the  other  datasets,  please 
see  Appendix  C.3. 

The  STAGE  parameters  shown  in  Table  4.6  were  used  in  all  domains.  Figures  4.9- 
4.11  and  Table  4.7  contrast  the  performance  of  hillclimbing  (HC),  simulated  annealing 
(SA)  and  STAGE.  For  reference,  the  table  also  gives  the  score  of  the  “linkless”  Bayes 
net— corresponding  to  the  simplest  model  of  the  data,  that  all  attributes  are  generated 
independently. 


Parameter 

Setting 

TT 

stochastic  hillclimbing,  patience=200 

ObjBound 

“O  “ 

features 

7  (Fitness  /x,  cr;  Complexity  fj.,  xr;  ^Parents  fi,  cr;  ^Orphans) 

fitter 

quadratic  regression 

Pat 

200 

TotEvals 

100,000  " 

Table  4.6.  Summary  of  STAGE  parameters  for  Bayes  net  results 


Instance 

Algorithm 

Performance  (100  runs 

each) 

moves 

mean 

best 

worst 

time 

accepted 

MPG 

HC 

3563.4±  0.3 

3561.3 

3567.4 

35s 

5% 

(linkless  score 

SA 

3568.2±  0.9 

3561.3 

3595.5 

47s 

30% 

=  5339.4) 

STAGE 

3564.1±  0.4 

3561.3 

3569.5 

48s 

2% 

ADULT2 

HC 

440567±  52 

439912 

441171 

239s 

6% 

(linkless  score 

SA 

440924±  134 

439551 

444094 

446s 

28% 

=  554090) 

STAGE 

440432±  57 

439773 

441052 

351s 

6% 

SYNTH125K 

HC 

748201±1714 

725364 

766325 

151s 

10% 

(linkless  score 

SA 

726882±1405 

718904 

754002 

142s 

29% 

=  1,594,498) 

STAGE 

730399±1852 

718804 

782531 

156  s 

4% 

Table  4.7.  Bayes  net  structure-finding  results 


On  SYNTH  125K,  the  largest  dataset,  simulated  annealing  and  STAGE  both  im¬ 
prove  significantly  over  multi-restart  hillclimbing,  usually  attaining  a  score  within  2% 
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Figure  4.9.  Bayes  net  performance  on  instance  MPG 


Figure  4.10.  Bayes  net  performance  on  instance  ADULT2 
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Figure  4.11.  Bayes  net  performance  on  instance  SYNTH  125K 
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of  that  of  the  Bayes  net  that  generated  the  data,  and  on  some  runs  coming  within 
0.04%.  A  good  solution  found  by  STAGE  is  drawn  in  Figure  4.8.  Simulated  an¬ 
nealing  slightly  outperforms  STAGE  on  average  (however,  Section  6.1.5  describes  an 
extension  which  improves  STAGE’S  performance  on  SYNTH125K).  On  the  MPG  and 
ADULT2  datasets,  HC  and  STAGE  performed  comparably,  while  SA  did  slightly  less 
well  on  average.  SA’s  odd-looking  performance  curves  deserve  further  explanation.  It 
turns  out  that  they  are  a  side  effect  of  the  large  scale  of  the  objective  function:  when 
single  moves  incur  large  changes  in  Obj ,  the  adaptive  annealing  schedule  (refer  to  Ap¬ 
pendix  B)  takes  longer  to  raise  the  temperature  to  a  suitably  high  initial  level,  which 
means  that  the  initial  part  of  SA’s  trajectory  is  effectively  performing  hillclimbing. 
During  this  phase  SA  can  find  quite  a  good  solution,  especially  in  the  real  datasets 
(MPG  and  ADULT2),  for  which  the  initial  state  (the  linkless  graph)  is  a  good  starting 
point  for  hillclimbing.  The  good  early  solutions,  then,  are  never  bettered  until  the 
temperature  decreases  late  in  the  schedule;  hence  the  best-so-far  curve  is  fiat  for  most 
of  each  run. 

All  algorithms  require  comparable  amounts  of  total  run  time,  except  on  the 
ADULT2  task  where  SA  and  STAGE  both  run  slower  than  HC.  On  that  task,  the  dif¬ 
ference  in  run  times  appears  to  be  caused  by  the  types  of  graph  structures  explored 
during  search;  SA  and  STAGE  spend  greater  effort  exploring  more  complex  net¬ 
works  with  more  connections,  at  which  the  objective  function  evaluates  more  slowly. 
stage’s  computational  overhead  for  learning  is  insignificant. 

In  sum,  stage’s  performance  on  the  Bayes  net  learning  task  was  less  dominant 
than  on  the  bin-packing  and  channel  routing  tasks,  but  it  was  still  more  consistently 
best  or  nearly  best  than  either  HC  or  SA  on  the  three  benchmark  instances  attempted. 


4.5  Radiotherapy  Treatment  Planning 

Radiation  therapy  is  a  method  of  treating  tumors  [Censor  et  al.  88].  As  illustrated 
in  Figure  4.12,  a  linear  accelerator  that  produces  a  radioactive  beam  is  mounted  on 
a  rotating  gantry,  and  the  patient  is  placed  so  that  the  tumor  is  at  the  center  of  the 
beam’s  rotation.  Depending  on  the  exact  equipment  being  used,  the  beam  can  be 
shaped  in  various  ways  as  it  rotates  around  the  patient.  A  vadiotherapy  treatment 
plan  specifies  the  beam’s  shape  and  intensity  at  a  fixed  number  of  source  angles. 

A  map  of  the  relevant  part  of  the  patient’s  body,  with  the  tumor  and  all  im¬ 
portant  structures  labelled,  is  available.  Also  known  are  reasonably  good  clinical 
forward  models  for  calculating,  from  a  treatment  plan,  the  distribution  of  radiation 
that  will  be  delivered  to  the  patient’s  tissues.  The  optimization  task,  then,  is  the 
following  “inverse  problem”:  given  the  map  and  the  forward  model,  produce  a  treat- 
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Figure  4.12.  Radiotherapy  treatment  planning  (from  [Censor  et  al.  88]) 


ment  plan  that  meets  target  radiation  doses  for  the  tumor  while  minimizing  damage 
to  sensitive  nearby  structures.  In  current  practice,  simulated  annealing  and/or  linear 
programming  are  often  used  for  this  problem  [Webb  91,  Webb  94]. 

Figure  4.13  illustrates  a  simplified  plana;r  instance  of  the  radiotherapy  problem. 
The  instance  consists  of  an  irregularly  shaped  tumor  and  four  sensitive  structures:  the 
eyes,  the  brainstem,  and  the  rest  of  the  head.  A  treatment  for  this  instance  consists  of 
a  plan  to  turn  the  accelerator  beam  either  on  or  off  at  each  of  100  beam  angles  evenly 
spaced  within  [— ^,  ^].  Given  a  treatment  plan,  the  objective  function  is  calculated 
by  summing  ten  terms:  an  overdose  penalty  and  an  underdose  penalty  for  each  of  the 
five  structures.  For  details  of  the  penalty  terms,  please  refer  to  Appendix  C.4. 

I  applied  hillclimbing  (HC),  simulated  annealing  (SA),  and  STAGE  to  this  domain. 
Objective  function  evaluations  are  computationally  expensive  here,  so  my  experiments 
considered  only  10,000  moves  per  run.  The  features  provided  to  STAGE  consisted  of 
the  ten  subcomponents  of  the  objective  function.  STAGE’S  parameter  settings  are 
given  in  Table  4.8. 

Results  of  200  runs  of  each  algorithm  are  shown  in  Figure  4.14  and  Table  4.9.  All 
performed  comparably,  but  STAGE’S  solutions  were  best  on  average.  Note,  however, 
that  the  very  best  solution  over  all  600  runs  was  found  by  a  hillclimbing  run.  The 
objective  function  computation  dominates  the  running  time;  STAGE’S  overhead  for 
learning  is  relatively  insignificant. 
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Figure  4.13.  Radiotherapy  instance  5E 


Parameter 

. 

Setting 

TT 

stochastic  hillclimbing,  patience=200 

ObjBound 

-Q 

features 

10  (overdose  penalty  and  underdose  penalty  for  each  organ) 

fitter 

quadratic  regression 

Pat 

^00  “  “ 

TotEvals 

10,000  — 

Table  4.8.  Summary  of  STAGE  parameters  for  radiotherapy  results 


Instance 

Algorithm 

Performance 

mean 

(200  runs 
best 

each) 

worst 

PS 

time 

moves 

accepted 

5E 

HC 

18.822±0.030 

18.003 

19.294 

550s 

5.5% 

SA 

18.817±0.043 

18.376 

19.395 

460s 

29% 

STAGE 

18.721±0.029 

18.294 

19.155 

530s 

4.9% 

Table  4.9.  Radiotherapy  results 
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Figure  4.14.  Radiotherapy  performance  on  instance  5E 

4.6  Cartogram  Design 

A  “cartogram”  or  “Density  Equalizing  Map  Projection”  is  a  geographic  map  whose 
subarea  boundaries  have  been  deformed  so  that  population  density  is  uniform  over  the 
entire  map  [Dorling  94,  Gusein- Zade  and  Tikunov  93,Dorling  96].  Such  maps  can  be 
useful  for  visualization  of,  say,  geographic  disease  distributions,  because  they  remove 
the  confounding  effect  of  population  density.  I  considered  the  particular  instance  of 
redrawing  the  map  of  the  continental  United  States  such  that  each  state’s  area  is 
proportional  to  its  electoral  vote  for  U.S.  President.  The  goal  of  optimization  is  to 
best  meet  the  new  area  targets  for  each  state  while  minimally  distorting  the  states’ 
shapes  and  borders. 

I  represented  the  map  as  a  collection  of  162  points  in  each  state  was  defined 
as  a  polygon  over  a  subset  of  those  points.  Search  begins  at  the  original,  undistorted 
U.S.  map.  The  search  operator  consisted  of  perturbing  a  random  point  slightly; 
perturbations  that  would  cause  two  edges  to  cross  were  disallowed.  The  objective 
function  was  defined  as 

Obj(a;)  =  AArea(a;)  -|-  AGape(a;)  4-  AOrient(a;)  -f  ASegfrac(a;) 

where  AArea(a;)  penalizes  states  for  missing  their  new  area  targets,  and  the  other 
three  terms  penalize  states  for  differing  in  shape  and  orientation  from  the  true  U.S. 
map.  For  details  of  these  penalty  terms,  please  refer  to  Appendix  C.5.  Each  of  the 
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Figure  4.15.  Cartograms  of  the  continental  U.S.  Each  state’s  target  area  is  propor¬ 
tional  to  its  electoral  vote  for  U.S.  President.  The  undistorted  U.S.  map  (top  left)  has 
zero  penalty  for  state  shapes  and  orientations  but  a  large  penalty  for  state  areas,  so 
=  525.7.  Hillclimbing  produces  solutions  like  the  one  shown  at  top  right,  for 
which  Obj(a;)  =  0.115.  The  third  cartogram,  found  by  STAGE,  has  Obj(a;)  =  0,043. 
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feature  values  can  be  updated  incrementally  when  a  single  vertex  is  moved,  so  local 
search  can  be  applied  to  optimize  Obj(a;)  quite  efficiently. 

For  STAGE,  I  represented  each  configuration  by  four  features — namely,  the  four 
subcomponents  of  Obj.  Learning  a  new  evaluation  function  with  quadratic  regression 
over  these  features,  STAGE  produced  a  significant  improvement  over  hillclimbing,  but 
was  outperformed  by  simulated  annealing.  Table  4.10  shows  STAGE’S  parameters, 
and  Table  4.11  and  Figure  4.16  show  the  results.  Later,  in  Sections  5. 2. 1-5. 2.4,  I 
report  the  results  of  further  experiments  on  the  cartogram  domain  using  varying 
feature  sets  and  function  approximators. 


Parameter 

Setting 

TT 

stochastic  hillclimbing,  patience=200 

ObjBound 

0 

features 

4  (subcomponents  of  Obj) 

fitter 

quadratic  regression 

Pat 

200 

1,000,000 

Table  4.10.  Summary  of  STAGE  parameters  for  cartogram  results 


Instance 

Algorithm 

Performance 

mean 

TOO  run 
best 

each) 

worst 

time, 

moves 

accepted 

US49 

HC 

0.174±0.002 

0.152 

0.195 

190s 

14% 

SA 

0.037±0.003 

0.031 

0.170 

130s 

32% 

STAGE 

0.056±0.003  j 

0.038 

0.132 

172s 

7% 

Table  4.11.  Cartogram  results 


4.7  Boolean  Satisfiability 

4.7.1  WALKSAT 

Finding  a  variable  assignment  that  satisfies  a  large  Boolean  expression  is  a  fundamental — 
indeed,  the  original — NP-complete  problem  [Garey  and  Johnson  79].  In  recent  years, 
surprisingly  difficult  formulas  have  been  solved  by  WALKSAT  [Selman  et  al.  96],  a 
simple  local  search  method.  WALKSAT,  given  a  formula  expressed  in  CNF  (a  con¬ 
junction  of  disjunctive  clauses),  conducts  a  random  walk  in  assignment  space  that  is 
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Figure  4.16.  Cartogram  performance  on  instance  US49 


biased  toward  minimizing 

Obj(a;)  =  ^  of  clauses  unsatisfied  by  assignment  x. 

When  Obj(a:)  =  0,  all  clauses  are  satisfied  and  the  formula  is  solved. 

WALKSAT  searches  as  follows.  On  each  step,  it  first  selects  an  unsatisfied  clause 
at  random;  it  will  satisfy  that  clause  by  flipping  one  variable  within  it.  To  decide 
which  one,  it  first  evaluates  how  much  overall  improvement  to  Obj  would  result  from 
flipping  each  variable.  If  the  best  such  improvement  is  positive,  it  greedily  flips  a 
variable  that  attains  that  improvement.  Otherwise,  it  flips  a  variable  which  worsens 
Obj:  with  probability  (l-noise),  a  variable  which  harms  Obj  the  least,  and  with 
probability  noise,  a  variable  at  random  from  the  clause.  The  best  setting  of  noise  is 
problem-dependent  [McAllester  et  al.  97]. 

WALKSAT  is  so  effective  that  it  has  rendered  nearly  obsolete  an  archive  of 
several  hundred  benchmark  problems  collected  for  a  1993  DIMACS  Challenge  on 
satisfiability  [Selman  et  al.  96].  Within  that  archive,  only  the  “32-bit  parity  func¬ 
tion  learning”  instances  (nefariously  constructed  by  Crawford,  Kearns,  Schapire,  and 
Hirsh  [Crawford  93])  are  known  to  be  solvable  in  principle,  yet  not  solvable  by  WALK¬ 
SAT.  The  development  of  an  algorithm  to  solve  these  32-bit  parity  instances  has  been 
listed  as  one  of  ten  outstanding  challenges  for  research  in  propositional  reasoning  and 
search  [Selman  et  al.  97].  Most  of  my  experiments  in  the  satisfiability  domain  have 
focused  on  the  first  such  instance  in  the  archive,  par32-l.cnf,  a  formula  consisting  of 
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10277  clauses  bn  3176  variables.  After  presenting  extensive  results  on  this  instance, 
I  will  report  additional  results  on  the  four  other  benchmark  instances  in  the  par32 
family. 

4.7.2  Experimental  Setup 

WALKSAT  is  generally  run  in  a  random  multi-restart  regime:  after  every  cutoff 
search  steps,  where  cutoff  is  another  parameter  of  the  algorithm,  it  resets  the  search  to 
a  random  new  assignment.  Can  STAGE,  by  observing  WALKSAT  trajectories,  learn 
a  smarter  restart  policy?  Certainly,  a  variety  of  additional  state  features  potentially 
useful  for  STAGE’S  learning  are  readily  available  in  this  domain.  I  used  the  following 
set  of  five  simple  features: 

•  proportion  of  clauses  currently  unsatisfied  (oc  Obj(a;)) 

•  proportion  of  clauses  satisfied  by  exactly  1  variable 

•  proportion  of  clauses  satisfied  by  exactly  2  variables 

•  proportion  of  variables  that  would  break  a  clause  if  flipped 

•  proportion  of  variables  set  to  their  “naive”  setting^ 

All  of  these  features  can  be  computed  incrementally  in  0(1)  time  after  any  variable 
is  flipped. 

For  comparison,  I  evaluated  the  performance  of  the  following  algorithms,  allowing 
each  a  total  of  10®  bit  flips  per  run: 

•  (HC):  random  multi-start  hillclimbing,  patience  =  10“*,  accepting  equi-cost 
moves.  Candidate  moves  are  generated  using  the  same  bit-flip  distribution 
as  WALKSAT,  but  moves  that  increase  the  number  of  unsatisfied  clauses  are 
rejected. 

•  (S/HC):  STAGE  applied  to  tt  =  HC.  Quadratic  regression  is  used  to  predict 
hillclimbing  outcomes  from  the  five  features  above.  For  STAGE’S  second  phase 
(stochastic  hillclimbing  on  K’^),  the  patience  is  set  to  1000,  and  candidate  bits 
to  flip  are  chosen  uniformly  at  random,  not  with  the  WALKSAT  distribution. 

Given  a  CNF  formula  F,  the  naive  setting  of  variable  Xi  is  defined  to  be  0  if  -’X,  appears  in 
more  clauses  of  F  than  x,-,  or  1  if  x,  appears  in  more  clauses  than  -<Xi. 
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•  (W):  WALKSAT,  noise  =  0,  cutoff  =  10®.  These  parameter  settings  were  hand- 
tuned  for  best  performance  over  the  range  noise  €  {0,0.05,0.1,0.15,...  ,0.5}, 
cutoff  G  {10®,  10'*,  10®,  10®,  10^,  10®}.  Note  that  the  chosen  cutoff  level  of  10® 
means  that  WALKSAT  performs  exactly  100  random  restarts  per  run. 

•  (S/W):  STAGE  applied  to  tt  =  WALKSAT.  This  combination  raises  some 
technical  issues  involving  WALKS  AT’s  termination  criterion,  as  I  explain  below. 

Theoretically,  as  discussed  in  Section  3.4.1,  STAGE  can  learn  from  any  procedure 
TT  that  is  proper  (guaranteed  to  terminate)  and  Markovian.  WALKSAT’s  normal  ter¬ 
mination  mechanism,  cutting  off  after  a  pre-specified  number  of  steps,  is  not  Marko¬ 
vian:  it  depends  on  an  extraneous  counter  variable,  not  just  the  current  assignment. 
To  apply  STAGE  to  tt  =  WALKSAT,  I  tried  three  modified  termination  criteria: 

•  (S/Wl):  use  stage’s  normal  patience-based  mechanism  for  cutting  off  each 
WALKSAT  trajectory,  just  as  I  do  for  hillclimbing.  Since  WALKSAT  is  non¬ 
monotonic,  this  mechanism  also  violates  the  Markov  property:  the  probability 
of  cutting  off  at  state  x  depends  not  just  on  x  but  also  on  an  extraneous  counter 
variable  and  the  best  Obj  value  seen  previously.  However,  this  is  easily  corrected 
by  the  following  adjustment. 

•  (S/W2):  use  patience-based  cutoffs,  but  train  the  function  approximator  on  only 
the  best-so-far  states  of  each  sample  WALKSAT  trajectory.  By  Proposition  2 
(presented  on  page  62,  proven  in  Appendix  A.l),  this  subsequence  of  states 
constitutes  a  sample  trajectory  from  a  higher-level  search  procedure  which  is 
proper,  strictly  monotonic,  and  Markovian,  so  is  well-defined. 

•  (S/W3):  cut  off  WALKSAT’s  trajectory  with  a  fixed  probability  6  >  0  after 
every  flip.  This  approach  results  in  a  proper  Markovian  trajectory,  so  is 
again  well-defined.  A  possible  drawback  to  this  approach  is  that  termination 
may  randomly  occur  during  a  fruitful  part  of  the  search  trajectory. 

For  (S/Wl),  I  hand-tuned  WALKSAT’s  noise  and  patience  parameters  over  the  same 
ranges  I  used  for  tuning  Experiment  (W).  Here,  I  found  best  performance  at  noise  = 
0.25  (more  random  actions)  and  patience  =  10^  (more  frequent  restarting).  Without 
further  tuning,  I  set  patience  =  10^  in  (S/W2),  e  =  10  ^  in  (S/W3),  and  noise  =  0.25 
in  both.  The  parameters  for  STAGE’S  second  phase,  hillclimbing  on  ,  were  set  as 
in  Experiment  (S/HC)  above. 

Serendipitously,  I  discovered  that  introducing  an  additional  WALKSAT  param¬ 
eter  could  improve  STAGE’S  performance.  The  new  parameter,  call  it  5^,  has  the 
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Parameter 

Setting 

TT 

(S/HC):  stochastic  hillclimbing,  patience=10‘*;  OR 
(S/W):  WALKSAT,  noise=0.25,  patience=10^  5^  =  10 

ObjBound 

0 

features 

5  (Obj,  %  clauses  with  1  true,  %  clauses  with  2  true,  %  clause¬ 
breaking  variables,  %  naive  variables) 

fitter 

quadratic  regression 

Pat 

1000 

TotEvals 

100,000,000 

Table  4.12.  Summary  of  STAGE  parameters  for  satisfiability  results 


following  effect:  any  flip  that  would  worsen  Obj  by  more  than  is  rejected.  Nor¬ 
mal  WALKSAT  has  =  oo.  Hillclimbing,  as  done  in  Experiment  (HC)  above,  is 
equivalent  to  WALKSAT  with  (5„  =  0,  which  performs  badly.  However,  using  inter¬ 
mediate  settings  of  — thereby  prohibiting  only  the  most  destructive  of  WALKSAT’s 
moves — seems  not  to  harm  WALKSAT’s  performance,  and  in  some  cases  improves  it. 
For  the  (S/W)  runs  reported  here,  I  set  =  10.  STAGE’S  parameter  settings  for 
both  Experiments  (S/HC)  and  (S/W)  are  summarized  in  Table  4.12. 

4.7.3  Main  Results 

The  main  results  are  shown  in  Figure  4.17  and  in  the  top  six  lines  of  Table  4.13. 
Experiment  (HC)  performs  quite  poorly,  leaving  about  50  clauses  unsatisfied  on  each 
run;  STAGE’S  learning  improves  on  this  significantly  (S/HC),  to  about  20  unsatis¬ 
fied  clauses.  WALKSAT  does  better  still,  leaving  15  unsatisfied  clauses  on  average. 
However,  all  the  STAGE/WALKSAT  runs  do  significantly  better,  leaving  only  about 
5  clauses  unsatisfied  on  average,  and  as  few  as  1  or  2  of  the  formula’s  10277  clauses 
unsatisfied  on  the  best  runs.  Although  STAGE  did  not  manage  to  find  an  assign¬ 
ment  with  0  clauses  unsatisfied,  these  are  currently  the  best  published  results  for  this 
benchmark  [Kautz  98]. 

I  also  repeated  the  STAGE/WALKSAT  experiments  using  linear  regression,  rather 
than  quadratic  regression,  as  the  function  approximator  for  .  With  only  6  coeffi¬ 
cients  being  fit  instead  of  21,  STAGE  still  produced  about  the  same  level  of  improve¬ 
ment  over  plain  WALKSAT  (see  Table  4.13).  Indeed,  the  fixed-probability  termina¬ 
tion  criterion  experiment  (S/W3)  performed  significantly  better  under  this  simpler 
regression  model. 

Table  4.14  shows  the  approximate  running  time  required  by  each  algorithm  on 
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Instance 

Algorithm 

Performance  {N  runs  each) 

mean 

best 

worst 

par32-l.cnf 

HC 

48.03±0.59 

40 

54 

{N  =  100) 

S/HC  (STAGE,  TT  =  hillclimbing) 

21.64±0.77 

11 

31 

W  (WALKSAT) 

15.22±0.35 

9 

19 

S/Wl  (STAGE,  TT  =  WALKSAT) 

5.36±0.33 

1 

9 

S/W2 

5.04±0.27 

2 

8 

S/W3 

5.60±0.29 

2 

9 

S/Wl-blinear 

5.27±0.37 

2 

14 

S/W2H-linear 

6.21±0.30 

2 

10 

S/W3-f-linear 

4.43±0.28 

2 

8 

par32-2.cnf 

W 

rT5.40±0.57 

13 

18 

(iV  =  25) 

S/W3-l-linear 

4.32±0.48 

2 

7 

par32-3.cnf 

pw  ^ 

15.84±0.45 

14 

18 

{N  =  25) 

S/W3-)-linear 

4.32±0.53 

2 

7 

par32-4.cnf 

W 

15.28±0.46 

13 

17 

(N  =  25) 

S/W3-|-linear 

4.63±0.60 

2 

7 

par32-5.cnf 

W 

15.48±0.61 

11 

18 

(N  =  25) 

S/W3-|-linear 

4.76±0.63 

2 

9 

Table  4.13.  Satisfiability  results  on  the  32-bit  parity  benchmarks 


Figure  4.17.  Satisfiability:  main  results  on  instance  par32-l.cnf 
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instance  par32-l.cnf.  In  this  domain,  STAGE  takes  about  twice  as  long  to  complete 
as  WALKSAT;  for  that  matter,  so  does  hillclimbing  (HC).  By  profiling  the  executions, 
I  identified  three  sources  of  the  disparity.  First,  WALKSAT  saves  time  by  accepting 
all  of  the  10*  proposed  bit  flips.  Every  time  HC  or  STAGE  rejects  a  move,  it  must 
unflip  the  modified  bit  and  re-update  the  counts  of  violated  clauses,  so  rejecting  a 
move  takes  twice  as  long  as  accepting  a  move.  Second,  for  all  the  STAGE  runs  other 
than  S/W2,  significant  time  is  spent  in  memory  management  for  storing  WALKSAT 
trajectories.  These  trajectories  often  consist  of  tens  of  thousands  of  states,  and  my 
implementation  is  not  optimized  to  handle  such  long  trajectories  efficiently.  Since 
the  S/W2  runs  train  on  only  the  best-so-far  states,  they  have  much  less  data  to  store 
and  do  not  pay  this  penalty.  Finally,  the  function  approximation  (computing  the 
coefficients  of  by  least-squares  regression)  adds  about  3%  additional  overhead  to 
stage’s  running  time. 

Table  4.14  also  reveals  that  the  S/W3-t-linear  run  accepts  fully  97%  of  its  moves. 
This  simply  indicates  that  it  is  spending  the  bulk  of  its  effort  in  the  WALKSAT  stage 
of  search  (during  which  100%  of  moves  are  accepted),  and  relatively  little  time  on  the 
stage  of  hillclimbing  on  V'^ .  A  closer  look  shows  that  this  happens  because  the  linear 
approximation  is  quite  inaccurate,  with  an  RMS  error  of  9.5  on  its  training  set,  and 
stage’s  ObjBound  parameter  cuts  off  hillclimbing  as  soon  as  the  predicted  V'^{x) 
falls  below  zero — typically  after  only  a  few  hundred  moves  on  each  iteration.  Though 
these  runs  reject  fewer  moves  than  their  S/W3-|-quadratic  counterparts,  their  extra 
WALKSAT  runs  mean  extra  memory  management  for  trajectory  storage,  so  their 
running  time  is  not  significantly  lessened. 

Despite  STAGE’S  overhead  in  this  domain,  it  is  clear  from  the  plot  of  Figure  4.17 
that  even  if  running  times  were  equalized  by  halving  STAGE’S  allotted  number  of 
moves,  stage’s  performance  would  still  significantly  exceed  pure  WALKSAT’s. 

4.7.4  Follow-up  Experiments 

I  conducted  three  additional  follow-up  experiments.  First,  I  compared  pure  WALK¬ 
SAT  with  STAGE/WALKSAT  on  the  other  four  32-bit  parity  instances  from  the 
DIMACS  archive.  The  results,  shown  in  Table  4.13,  corroborate  the  results  on  par32- 
1:  STAGE  consistently  leaves  2/3  fewer  unsatisfied  clauses  than  WALKSAT  on  these 
problems. 

Second,  I  studied  each  of  the  WALKSAT  parameter  differences  between  Exper¬ 
iments  (W)  and  (S/W3)  in  isolation.  Specifically,  I  ran  eight  head-to-head  exper¬ 
imental  comparisons  of  WALKSAT  and  STAGE/WALKSAT  with  the  WALKSAT 
parameters  fixed  to  be  the  same  in  both  algorithms.  The  eight  experiments  corre- 
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Algorithm 

Ri  time 

moves 

accepted 

extra  time 
to  undo 

moves 

extra  time 
for  memory 
management 

HC 

3200s 

61% 

© 

S/HC  (STAGE,  TT  =  hillclimbing) 

5000s 

38% 

0© 

© 

W  (WALKSAT) 

1900s 

100% 

S/Wl  (STAGE,  TT  =  WALKSAT) 

4100s 

69% 

© 

© 

S/W2 

3200s 

54% 

© 

S/W3 

3900s 

70% 

'  © 

© 

S/Wl-flinear 

4100s 

55% 

© 

© 

S/W2-|-linear 

3100s 

55% 

© 

S/W3-flinear 

3600s 

97% 

©© 

Table  4.14.  Approximate  running  times  on  instance  par32-l.cnf.  The  stopwatch 
icons  indicate  which  aspects  of  search  caused  running  time  to  exceed  WALKSAT’s. 


sponded  to  all  combinations  of  the  following  settings: 

(cutoff,  noise,  S^)  e  {10^,10^}  x  {0,0.25}  x  {oo,  10} 

The  results  were  as  follows.  In  two  of  the  eight  comparisons,  namely,  where  cutoff = 
10®  and  noise  =  0,  WALKSAT  and  STAGE  performed  statistically  equivalently; 
stage’s  adaptive  restarting  performed  neither  better  nor  worse  than  WALKSAT’s 
random  restarting.  In  the  other  six  comparisons,  however,  STAGE  improved  perfor¬ 
mance  dramatically  over  plain  WALKSAT. 

Third,  in  an  attempt  to  get  STAGE  to  satisfy  all  the  clauses  of  formula  par32-l.cnf, 

1  ran  eight  additional  runs  of  (S/W3-|-linear)  with  an  extended  limit  of  TotEvals  = 

2  X  10®  bit  flips.  These  runs  used  about  24  hours  of  computation  each.  The  result 
was  that  one  run  left  1  clause  unsatisfied,  five  runs  left  2  clauses  unsatisfied,  and  two 
runs  left  3  clauses  unsatisfied.  None  solved  the  formula.  Future  work  will  pursue 
this  further:  I  hope  a  more  insightful  set  of  features,  a  less  hasty  job  of  parameter¬ 
tuning,  or  longer  runs  will  enable  STAGE  to  cross  the  finish  line.”  In  any  event, 
STAGE  certainly  shows  promise  for  hard  satisfiability  problems — perhaps  especially 
for  MAXSAT  problems  where  near-miss  solutions  are  useful  [Jiang  et  al.  95]. 

4.8  Boggle  Board  Setup 

In  the  game  of  Boggle,  25  cubes  with  letters  printed  on  each  face  are  shaken  into 
a  5  X  5  grid  (see  Figure  4.18).®  The  object  of  the  game  is  to  find  English  words 

®Boggle  is  published  by  Parker  Brothers,  Inc.  The  25-cube  version  is  known  as  “Big  Boggle”  or 
“Boggle  Master.” 
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that  are  spelled  out  by  connected  paths  through  the  grid.  A  legal  path  may  include 
horizontal,  vertical,  and/or  diagonal  steps;  it  may  not  include  any  cube  more  than 
once.  Long  words  are  more  valuable  than  short  ones:  the  scoring  system  counts  1 
point  for  4-letter  words,  2  points  for  5-letter  words,  3  points  for  6-letter  words,  5 
points  for  7-letter  words,  and  11  points  for  words  of  length  8  or  greater. 


R 

S 

T 

C 

S 

D 

E 

1 

A 

E 

G 

N 

L 

J 

R 

P 

E 

A 

^  s 

T 

E 

S 

S 

S 

1 

D 

G 

R 

R 

W 

Y 

H 

W 

X 

V 

P 

Z 

K 

Y 

T 

W 

D 

J 

G 

D 

D 

Y 

A 

D 

S 

Y 

Figure  4.18.  A  random  Boggle  board  (8  words,  score=10,  Obj=— 0.010)  and  an 
optimized  Boggle  board  (2034  words,  score=9245,  Obj=— 9.245).  The  latter  includes 
such  high-scoring  words  as  depreciated,  distracting,  specialties,  delicateness  and  des¬ 
perateness. 


Given  a  fixed  board  setup  x,  finding  all  the  English  words  in  it  is  a  simple  computa¬ 
tional  task;  by  representing  the  dictionary®  as  a  prefix  tree,  Scbre(x)  can  be  computed 
in  about  a  millisecond.  It  is  a  difficult  optimization  task,  however,  to  identify  what 
fixed  board  x*  has  the  highest  score.  For  consistency  with  the  other  domains  of  this 
chapter,  I  pose  the  problem  as  a  minimization  task,  where  Obj(a;)  =  — Score(a;)/1000. 
(I  used  the  scaling  factor  of  1/1000  to  avoid  a  possible  repeat  of  the  flat  simulated 
annealing  performance  curves  found  and  explained  in  Section  4.4.)  Note  that  I  allow 
any  letter  to  appear  in  any  position  of  x,  rather  than  constraining  them  to  the  faces 
of  real  6-sided  Boggle  cubes.  Exhaustive  search  of  26^®  Boggle  boards  is  intractable, 
so  local  search  is  a  natural  approach. 

I  set  up  the  search  space  as  follows.  The  initial  state  is  constructed  by  choosing 
25  letters  uniformly  at  random.  Then,  to  generate  a  neighboring  state,  either  of  the 
following  operators  is  applied  with  probability  0.5: 

•  Select  a  grid  square  at  random  and  choose  a  new  letter  for  it.  (The  new  letter 
is  selected  with  probability  equal  to  its  unigram  frequency  in  the  dictionary.) 


®My  experiments  make  use  of  the  126, 468- word  Official  Scrabble  Player’s  Dictionary. 
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•  Or,  select  a  grid  square  at  random,  and  swap  the  letter  at  that  position  with 
the  letter  at  a  random  adjacent  position. 

The  following  features  of  each  state  x  were  provided  for  STAGE’S  learning: 

1.  The  objective  function,  Obj(a;)  =  -Score(a;)/1000. 

2.  The  number  of  vowels  on  board  x. 

3.  The  number  of  distinct  letters  on  board  x. 

4.  The  sum  of  the  unigram  frequencies  of  the  letters  of  x.  (These  frequencies,  com¬ 
puted  directly  from  the  dictionary,  range  from  Freq(e)  =  0.1034  to  Freq(5)  = 
0.0016.) 

5.  The  sum  of  the  bigram  frequencies  of  all  adjacent  pairs  of  x. 


These  features  are  cheap  to  compute  incrementally  after  each  move  in  state  space, 
and  intuitively  should  be  helpful  for  STAGE  in  learning  to  distinguish  promising  from 
unpromising  boards. 


Parameter 

Setting 

TT 

stochastic  hillclimbing,  patience=1000 

ObjBound 

^OO 

features 

5  (Obj,  ^  vowels,  ^  distinct,  ^  unigram,  ^  bigram) 

fitter 

quadratic  regression 

Pat 

200 

TotEvals 

100,000 

Table  4.15.  Summary  of  STAGE  parameters  for  Boggle  results 


However,  STAGE’s  results  on  Boggle  were  disappointing.  STAGE’S  parameters 
are  shown  in  Table  4.15  and  comparative  results  are  shown  in  Figure  4.19  and  Ta¬ 
ble  4.16.  Average  runs  of  hillclimbing  (patience=1000),  simulated  annealing,  and 
STAGE  all  reach  the  same  Boggle  score,  about  8400-8500  points. 

Boggle  is  the  only  domain  I  have  tried  on  which  STAGE’S  learned  smart  restarting 
does  not  improve  significantly  over  random-restart  hillclimbing.  This  provides  an 
interesting  opportunity  to  compare  the  properties  of  a  domain  that  is  poor  for  STAGE 
with  the  other  domains;  I  do  so  in  Section  5.1.4. 


•  Boggle  score  (in  thousands) 
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Instance 

Algorithm 

Performance  (100  runs  each) 

mean  best  worst  time 

moves 

accepted 

5x5 

HC 

-8.413±0.066.  ,  .9.046  ,..7.473  .,2235s 

2.4% 

SA 

-8.431±0.086  -9.272  -7.622  1720s 

33% 

STAGE 

-8.480±0.077  -9.355  -7.570  2450s 

1.3% 

Table  4.16.  Boggle  results.  The  mean  performances  do  not  differ  significantly  from 
one  another. 


20000  40000  60000  80000  100000  HC  SA  STAGE 

Number  of  moves  considered 


Figure  4.19.  Performance  on  the  Boggle  domain 
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4.9  Discussion 

This  chapter  has  demonstrated  that  STAGE  may  be  profitably  applied  to  a  wide 
variety  of  global  optimization  problems.  In  addition  to  the  domains  reported  here,  an 
algorithm  closely  related  to  STAGE  has  also  been  successfully  applied  to  the  “Dial¬ 
a-Ride”  problem,  a  variant  of  the  Travelling  Salesperson  Problem,  by  Moll  et  al.  [97], 
Empirically,  in  both  discrete  domains  (e.g.,  channel  rbiiting,  Bayes  net  structure¬ 
finding,  satisfiability)  and  continuous  domains  (e.g.,  cartogram  design),  STAGE  is 
able  to  learn  from  and  improve  upon  the  results  of  local  search  trajectories.  In  the 
next  chapter,  I  will  probe  more  deeply  into  the  reasons  for  STAGE’S  success. 
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In  the  preceding  chapter,  I  gave  empirical  evidence  that  STAGE  is  an  effective 
optimization  technique  for  a  wide  variety  of  domains.  In  this  chapter,  I  break  down 
the  causes  for  that  effectiveness.  Does  STAGE’S  power  derive  from  its  use  of  machine 
learning,  per  its  design?  Or  are  incidental  aspects  of  the  way  it  organizes  its  search  just 
as  responsible  for  STAGE’S  success?  How  robust  is  the  algorithm  to  varying  choices 
of  feature  sets,  function  approximators,  and  other  user-controllable  parameters? 

This  chapter  reports  the  results  of  a  series  of  experiments  that  explore  these 
questions  empirically.  Most  of  the  experiments  are  performed  in  the  domains  of  VLSI 
channel  routing,  as  described  in  Section  4.3,  and  cartogram  design,  as  described  in 
Section  4.6. 

5.1  Explaining  STAGE’S  Success 

STAGE  performs  superbly  on  the  channel  routing  domain,  not  only  outperforming 
hillclimbing  as  it  was  trained  to  do,  but  also  finding  better  solutions  on  average  than 
the  best  simulated  annealing  runs.  How  can  we  explain  its  success?  There  are  at 
least  three  possibilities: 

Hypothesis  A:  STAGE  works  according  to  its  design.  Gathering  data  from  a  num¬ 
ber  of  hillclimbing  trajectories,  it  learns  to  predict  the  outcome  of  hillclimbing 
starting  from  various  state  features,  and  it  exploits  these  predictions  to  reach 
improved  local  optima. 

Hypothesis  B:  Since  STAGE  alternates  between  simple  hillclimbing  and  another 
policy,  it  simply  benefits  from  having  more  random  exploration.  STAGE  uses 
hillclimbing  on  the  learned  V'^  as  its  secondary  policy,  but  alternative  policies 
would  do  just  as  well. 

Hypothesis  C:  The  function  approximator  may  simply  be  smoothing  the  objective 
function,  which  helps  eliminate  local  minima  and  plateaus. 

In  this  section,  I  first  describe  experiments  which  reject  the  latter  two  hypotheses. 
I  then  describe  experiments  giving  further  evidence  for  Hypothesis  A,  that  STAGE 
does  indeed  work  as  designed.  Finally,  I  analyze  an  instance  where  STAGE  failed. 
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5.1.1  versus  Other  Secondary  Policies 

Hypothesis  B  attributes  STAGE’s  success  to  its  unusual  regime  of  alternating  between 
hillclimbing  and  a  secondary  search  policy.  Perhaps  any  secondary  policy  which 
perturbed  the  search  away  from  its  current  local  optimum  would  be  just  as  effective 
as  stage’s  secondary  policy  of  hillclimbing  on  the  learned  evaluation  function  V^. 

To  test  this,  I  ran  experiments  with  several  alternative  choices  for  STAGE’s  sec¬ 
ondary  policy: 

Policy  BO  (normal  STAGE):  Hillclimb  on  . 

Policy  Bl:  Perform  a  random  walk  of  fixed  length  u.  (I  tried  setting  a;  to  1,  3,  10, 
and  40.) 

Policy  B2:  Hillclimb  on  the  inverse  of  V'^ .  In  other  words,  move  stochastically  to 
a  state  which  predicts  to  be  the  worst  place  from  which  to  begin  a  search. 

Policy  B3:  Hillclimb  on  a  corrupted  version  of  V’",  trained  by  replacing  every  target 
value  in  its  training  set  with  a  random  value. 

Each  of  these  policies  was  alternated  with  standard  hillclimbing  (patience=250),  so 
as  to  imitate  STAGE’s  normal  regime.  Experiments  BO,  B2  and  B3  approximated 
using  linear  regression  over  the  three  channel  routing  features  (w,  p,  U),  each  scaled  to 
the  range  [0,1].  I  ran  each  resulting  algorithm  50  times  on  channel  routing  instance 
YK4,  limiting  each  run  to  TotEvals  =  10®  total  moves  considered.  (Note  that  the 
results  of  Section  4.3  were  tabulated  over  longer  runs  of  length  TotEvals  =  5  •  10®.) 


Instance 

Algorithm 

Performance  (50  runs  each) 

mean 

best 

worst 

YK4 

BO  (STAGE) 

13.54±1.13 

12 

41 

Bl  {u>  =  1) 

16.94±0.23 

15 

19 

Bl  (uj  =  3) 

17.16±0.24 

16 

19 

Bl  (co  =  10) 

19.12±0.27 

17 

21 

Bl  (w  =  40) 

22.64±0.26 

20 

25 

B2 

37.86±3.52 

14 

52 

B3 

15.38±0.64 

13 

27 

Table  5.1.  Results  of  different  secondary  policies  on  channel  routing  instance  YK4. 
In  each  case,  TotEvals  =  10®. 


The  results,  given  in  Figure  5.1  and  Table  5.1,  show  conclusively  that  the  choice  of 
secondary  policy  does  matter.  STAGE’s  policy  significantly  outperformed  all  others. 
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Figure  5.1.  Performance  of  different  secondary  policies  on  channel  routing  instance 

YK4 


with  48  of  its  50  runs  producing  a  solution  circuit  of  quality  between  12  and  14. 
(One  outlier  run  did  no  better  than  41;  I  analyze  this  in  Section  5.1.3.)  The  random 
walk  policies  (Bl)  were  consistently  inferior.  Indeed,  the  pattern  of  the  B1  results 
shows  clearly  that  performance  degraded  more  and  more  as  additional  undirected 
exploration  was  allowed.  Experiment  B2,  in  which  the  search  was  purposely  led 
against  toward  states  judged  unpromising,  performed  much  worse  than  even  the 
random-walk  policies.  This  provides  further  evidence  that  has  learned  useful 
predictive  information  about  the  features. 

Finally,  Experiment  B3  performed  better  than  the  random-walk  policies,  but  still 
significantly  worse  than  STAGE.  Why  did  assigning  random  target  values  to  the 
training  set  for  V""'  result  in  even  moderately  good  performance?  The  answer  is  that 
in  this  experiment,  the  linear  regression  model  has  only  four  coefficients  to  fit,  and 
one  of  these  (the  coefficient  of  the  constant  term)  has  no  effect  when  is  used  as 
an  evaluation  function  to  guide  search.  Thus,  even  choosing  random  values  for  the  3 
meaningful  coefficients  would  sometimes  lead  search  in  the  same  useful  direction  as 
the  true  V".  This  analysis  is  supported  by  the  plot  of  Figure  5.2.  This  plot  compares 
the  successive  local  minima  visited  by  a  typical  run  of  STAGE  and  a  typical  run  of 
Experiment  B3.  Clearly,  STAGE  learns  a  policy  which  keeps  it  in  a  high-quality  part 
of  the  space  during  most  of  its  search,  while  B3  only  occasionally  “lucks  into”  a  good 
solution.  Still,  these  results  highlight  the  potential  for  an  alternative  algorithm  to 


110 


STAGE:  ANALYSIS 


Figure  5.2.  Quality  of  local  minima  reached  on  each  successive  iteration  of  a  STAGE 
run  and  a  B3  run.  Gaps  in  the  STAGE  plot  correspond  to  iterations  on  which  a  restart 
occurred. 


STAGE,  which  works  by  optimizing  the  coefficients  of  a  V’^-like  function  directly  (see 
Section  8.2.3). 

The  results  of  this  series  of  experiments  do  not  prove  that  hillclimbing  on 
is  the  only  policy  which  can  productively  lead  search  away  from  a  local  minimum 
and  into  the  attracting  basin  of  a  potentially  better  local  minimum.  However,  they 
do  at  least  demonstrate  that  not  every  secondary  policy  will  be  effective — and  that 
stage’s  learned  policy  is  more  effective  than  most. 

5.1.2  versus  Simple  Smoothing 

STAGE  performs  “predictive  smoothing”  of  the  original  objective  function  Obj:  the 
function  is  a  continuous  mapping  from  the  features  of  state  x  to  the  Obj  value 
that  policy  tt  is  predicted  to  eventually  reach  from  x.  But  perhaps,  as  suggested 
by  Hypothesis  C  above,  the  predictive  aspect  of  STAGE’S  smoothing  is  irrelevant. 
Perhaps  STAGE’S  success  could  be  replicated  by  simply  smoothing  Obj  directly  over 
the  feature  space,  which  would  eliminate  the  original  objective  function’s  local  minima 
and  plateaus. 

One  flaw  of  this  hypothesis  is  immediately  apparent:  in  channel  routing  as  in  most 
of  the  other  domains  described  in  Chapter  4,  the  objective  function  value  Obj  (a;)  is 
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provided  to  STAGE  as  one  of  the  features  of  the  input  feature  space.  Therefore,  any 
function  approximator  could  fit  Obj  perfectly  over  the  feature  space  by  simply  copying 
the  Obj(x)  feature  from  input  to  output.  Clearly,  this  perfect  “smoothing”  of  Obj 
would  not  eliminate  any  local  minima  from  Obj;  STAGE  would  reduce  to  standard 
multi-restart  hillclimbing.  Experiment  CO,  described  below,  empirically  demonstrates 
precisely  this  effect. 

However,  perhaps  an  imperfect  smoothing  of  Obj  would  produce  the  hypothesized 
good  outcome.  I  performed  the  following  series  of  experiments.  For  all  experiments 
except  HC,  the  basic  STAGE  regime  is  unchanged;  only  the  training  set’s  target 
values  for  the  function  approximator  are  modified. 

Experiment  HC:  Multi-restart  hillclimbing,  accepting  equi-cost  moves,  patience  = 
500. 

Experiment  STAGE  (same  as  Experiment  BO):  Normal  STAGE,  modelling  y’^(x) 
by  linear  regression  over  the  three  features  {w(x),p(x),U{x)). 

Experiment  CO  (perfect  “smoothing”):  For  each  state  in  STAGE’S  training  set, 
represented  by  the  features  {w{x),p{x),U{x)),  train  the  function  approximator 
to  model  not  y’^(x)  but  the  objective  function  Obj(a;)  =  w{x).  Here,  the 
function  approximator  can  simply  copy  w{x)  from  input  to  output,  so  no  real 
smoothing  occurs.  Results  are  shown  in  Table  5.2  and  Figure  5.3:  as  expected, 
this  policy  performed  similarly  to  multi-start  hillclimbing. 

Experiment  Cl:  I  eliminated  the  w  feature  from  the  training  set,  so  the  function 
approximator  had  to  model  Obj(a;)  as  a  linear  function  of  only  {p(x),U(x)). 
The  results  were  extremely  poor.  A  closer  look  explained  why.  Recall  that 
Wong  defined 

U{x)  =  '^Ui{xY 

«=1 

where  Ui{x)  is  the  fraction  of  track  i  that  is  unoccupied  [Wong  et  al.  88].  In 
poor-quality  solutions  such  as  those  visited  at  the  beginning  of  a  search,  Ui{x) 
is  close  to  1  for  each  track,  so  Ui^x)  Obj(a;).  Thus,  even  a  linear  function 
approximator  can  model  Obj  quite  accurately  by  simply  copying  U{x)  from 
input  to  output.  But  this  smoothed  objective  function,  U{x)^  makes  a  terrible 
evaluation  function  for  optimization.  Starting  from  a  local  optimum  of  Obj, 
where  no  single  move  can  reduce  w{^x)  further,  the  only  way  a  greedy  search  can 
reduce  U  is  to  reduce  the  variance  of  the  Ui{x)  values,  that  is,  to  distribute  the 
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wires  evenly  throughout  the  available  tracks.  That  is  exactly  the  opposite  of  the 
successful  strategy  that  STAGE  discovers:  to  distribute  the  wires  as  unevenly  as 
possible,  thereby  creating  some  near-empty  tracks  which  hillclimbing  can  more 
readily  eliminate. 

Experiment  C2:  Using  the  insight  gained  from  the  last  experiment,  I  replaced  U{x) 
by  a  closely  related  feature: 

w{x) 

V{X)  t'  '^(1  - 

i=l 

With  a  bit  of  algebra,  it  can  be  shown  that  V{x^  =  —  w(^x)  -|-  C*,  where  C 

is  a  constant.  With  tc(aj)  subtracted  from  the  feature,  the  regression  model  can 
no  longer  fit  Obj  well  by  simply  copying  a  feature  through.  Indeed,  after  10® 
steps  of  STAGE,  the  linearly  smoothed  fit  of  Obj  over  {p{x),  U(x))  has  an  RMS 
error  around  13,  compared  with  an  RMS  error  of  less  than  0.1  in  Experiment 
Cl.  Thus,  this  fit  does  perform  significant  smoothing.  The  smoothed  fit  assigns 
a  negative  coefficient  to  V{x),  so  searching  on  it  tends  to  increase  the  variance 
of  the  Uj,  resulting  in  improved  overall  performance  relative  to  both  CO  and 
Cl.  Nevertheless,  C2’s  performance  is  still  much  worse  than  STAGE’s.  (In 
Section  5.2.1  below,  I  show  that  STAGE’s  performance  with  the  p  and  V  features 
is  even  better  than  STAGE  with  w,  p  and  U.) 


Instance 

Algorithm 

Performance 

mean 

(50  ru 
best 

ns  each) 
worst 

RMS 
of  fit 

YK4 

HC 

21.32±0.22 

19 

23 

— 

STAGE 

13.54±1.13 

12 

41 

2.1 

CO 

21.06±0.74 

15 

27 

0.0 

Cl 

40.00±0.18 

38 

41 

0.05 

C2 

19.34±1.00 

15 

28 

13.6 

Table  5.2.  Performance  comparison  of  STAGE  with  STAGE-like  algorithms  that 
simply  smooth  Obj.  For  each  run,  TotEvals  =  10®.  The  RMS  column  gives  a 
typical  root-mean-square  error  of  the  learned  evaluation  function  over  its  training  set 
at  the  end  of  the  run. 


In  addition  to  the  results  plotted  above,  I  also  repeated  experiments  Cl  and 
C2  using  quadratic  rather  than  linear  regression  to  smooth  Obj.  This  did  decrease 
the  RMS  error  of  the  approximations,  to  approximately  0.04  for  Cl  and  3.9  for  C2. 
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Number  of  moves  considered 

Figure  5.3.  Performance  on  the  experiments  of  Section  5.1.2 

However,  it  did  not  improve  optimization  performance  in  either  case.  From  this  series 
of  experiments,  I  conclude  that,  at  least  for  this  problem,  STAGE’S  success  cannot 
be  attributed  to  simple  smoothing  of  the  objective  function.  STAGE’s  predictive 
smoothing  works  significantly  better. 

5.1.3  Learning  Curves  for  Channel  Routing 

I  have  presented  evidence  that  STAGE’s  secondary  policy,  searching  on  is  more 
helpful  to  optimization  than  other  reasonable  policies  based  on  randomness  or  smooth¬ 
ing.  This  evidence  contradicts  Hypotheses  B  and  C  outlined  on  page  107.  I  now 
examine  STAGE’s  success  on  channel  routing  more  closely,  to  support  Hypothesis  A: 
that  stage’s  leverage  is  indeed  due  to  machine  learning. 

Figure  5.4  illustrates  three  short  runs  of  STAGE  on  routing  instance  YK4.  In  each 
row  of  the  figure,  the  left-hand  graph  illustrates  the  successive  local  minima  visited 
over  the  course  of  STAGE’s  run:  a  diamond  symbol  (<C>)  marks  a  local  minimum  of 
Obj,  and  a  plus  symbol  (-f)  marks  a  local  minimum  of  the  learned  evaluation  function 
V^.  There  is  a  gap  in  the  plot  each  time  STAGE  resets  search  to  the  poor  initial 
state.  Observe  that  the  first  two  runs,  (A)  and  (B),  perform  well,  both  attaining  a 
best  solution  quality  of  12;  whereas  run  (C)  performs  uncommonly  poorly,  with  a 
best  solution  quality  of  34. 

The  right-hand  graphs  track  the  evolution  of  V’’'  over  the  course  of  each  run, 
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plotting  the  coefficients  /?„,,  f3p,  and  jSu  on  each  iteration,  where 

W{x)  =  ■  w(x)  +  /3p-p{x)  +  (3u-U(x)+  pi 

For  clarity,  the  constant  coefficient  /?i  is  omitted  from  the  plots:  its  value,  typically 
around  -80,  has  no  effect  on  search  since  it  leaves  the  relative  ordering  of  states 
unchanged. 

The  coefficient  plots  show  that  all  runs  converge  rather  quickly  to  similar  V"^  func¬ 
tions.  The  learned  functions  have  a  high  positive  coefficient  on  w{x),  a  negative 
coefficient  of  nearly  equal  magnitude  on  U{x),  and  a  coefficient  near  zero  on  p{x). 
The  assignment  of  a  negative  coefficient  to  U  is  surprising,  because  U  measures  the 
sparseness  of  the  horizontal  tracks.  U  correlates  strongly  positively  with  the  objec¬ 
tive  function  to  be  minimized;  a  term  of  —U  in  the  evaluation  function  ought  to  pull 
search  toward  terrible  solutions  in  which  each  subnet  occupies  its  own  track.  Indeed, 
the  hand-tuned  evaluation  function  built  by  Wong  et  al.  for  this  problem  assigned 

=  +10  [Wong  et  al.  88]. 

However,  the  positive  coefficient  on  w  cancels  out  this  bias.  Recall  from  Experi¬ 
ment  C2  of  the  previous  section  that  the  following  relation  holds: 

V {x)  =  U {x)  —  w{x)  C 

where  V{x)  is  a  feature  measuring  the  variance  in  track  fullness  levels,  and  C  is  a* 
constant.  By  assigning  /3u,  «  —(^Ui  STAGE  builds  the  evaluation  function 

V^ix)  ^  -pu  ■  w{x)  +  ■  p{x)  +  /3u  -  Uix)  +  /3i 

=  i3u{U{x)  -  w{x))  +  ■  p{x)  +  (3i 

=  f3u -Vix)  + /3p- p{x)  +  f3[ 

Thus,  in  order  to  predict  search  performance,  STAGE  learns  to  extract  the  variance 
feature  The  coefficient  /Su  is  negative,  so  minimizing  STAGE’S  learned  evalua¬ 

tion  function  biases  search  toward  increasing  V{x) — that  is,  toward  creating  solutions  - 
with  an  uneven  distribution  of  track  fullness  levels.  Although  this  characteristic  is 
not  itself  the  mark  of  a  high-quality  solution,  it  does  help  lead  hillclimbing  search  to 
high-quality  solutions. 

Given  that  all  three  STAGE  runs  of  Figure  5.4  learned  quickly  to  extract  the 
variance  feature  U  —  w,  why  are  their  performance  curves  so  different?  A  close 
look  reveals  that  the  third  feature,  p{x),  matters:  performance  breaks  through  to 
the  “excellent”  level  when  the  coefficient  /?p  becomes  positive.  In  run  (A),  which  is 
most  typical,  f3p  is  positive  throughout  the  entire  run,  and  STAGE  reaches  excellent 
solutions  on  every  iteration.  In  run  (B),  ^p  becomes  consistently  positive  only  around 


5.1  EXPLAINING  STAGE’S  SUCCESS 


115 


0  20000  40000  60000  80000  100000  0  20000  40000  60000  80000  100000 

Number  of  moves  considered  Number  of  moves  considered 


0  20000  40000  60000  80000  1 00000  0  20000  40000  60000  80000  1 00000 

Number  of  moves  considered  Number  of  moves  considered 


0  20000  40000  60000  80000  100000  0  20000  40000  60000  80000  100000 

Number  of  moves  considered  Number  of  moves  considered 


Figure  5.4.  Three  runs  of  STAGE  on  channel  routing  instance  YK4.  Left-hand 
graphs  plot  the  Obj  value  reached  at  the  end  of  each  search  trajectory  within  STAGE; 
right-hand  graphs  plot  the  evolution  of  the  coefficients  of  V’^{x)  over  the  run. 
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Figure  5.5.  Run  (C)  of  Figure  5.4,  extended  to  TotEvals  =  500,000. 


move  60000,  which  corresponds  to  STAGE’S  breakthrough.  Finally,  in  the  outlier 
run  (C),  fSj,  remains  negative  throughout  the  first  100, 000  moves.  However,  it  is  slowly 
increasing,  and  in  fact,  if  the  same  run  is  allowed  to  extend  to  TotEvals  =  500, 000, 
then  /3p  soon  becomes  positive  and  performance  improves  (see  Figure  5.5).  This  run 
reaches  a  solution  quality  of  12  after  145, 000  moves. 

These  runs  illustrate  that  STAGE’S  convergence  rate  may  depend  on  the  particular 
regions  that  STAGE  happens  to  explore.  In  run  (C),  STAGE  apparently  becomes 
“stuck”  in  a  region  of  poor  local  optima  until  the  (3p  coefficient  becomes  positive, 
triggering  a  random  restart  from  which  the  learned  V'^  can  be  effectively  exploited. 
This  suggests  that  faster  convergence  may  be  attained  by  more  frequent  random 
restarts;  I  investigate  this  in  Section  5.2.4. 

Despite  the  varying  convergence  rates,  all  the  STAGE  runs  on  instance  YK4  do 
eventually  converge  to  an  approximation  of  F”  close  to  the  following: 

V^{x)  w  10  •  w{x)  +  0.5  •  p{x)  -  10  •  U{x)  -  80  (5.1) 

The  coefficients  on  all  three  features  are  significant.  As  discussed  above,  the  compo¬ 
nent  10(iy(a:)  -  U (x))  biases  search  toward  solutions  with  an  uneven  distribution  of 
track  fullness  levels.  As  for  p{x),  recall  from  Section  4.3  that  it  provides  a  lower  bound 
on  all  solutions  derived  from  x  by  future  merging  of  tracks.  Thus,  the  coefficient  of 
+0.5  on  p{x)  helps  keep  the  search  from  straying  onto  unpromising  trajectories.  In 
Section  6.2  I  show  that  STAGE  learns  similar  coefficients  on  a  number  of  other  in¬ 
stances  of  the  channel  routing  problem. 
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5.1.4  stage’s  Failure  on  Boggle  Setup 

STAGE  improved  significantly  over  multi-restart  hillclimbing  on  six  of  the  seven  large- 
scale  optimization  domains  presented  in  Chapter  4.  On  the  “Boggle  setup”  domain 
of  Section  4.8,  however,  STAGE  failed  to  produce  a  significant  improvement.  What 
explains  this  failure? 

According  to  Hypothesis  A,  STAGE  succeeds  when  it  can  identify  and  exploit 
trends  in  the  mapping  from  starting  state  to  hill  climbing  outcome.  As  it  turns  out, 
the  Boggle  domain  illustrates  the  converse:  when  the  results  of  hillclimbing  from  a 
variety  of  starting  states  show  no  discernible  trend,  then  STAGE  will  fail. 

The  following  experiment  makes  this  clear.  I  ran  50  restarts  of  hillclimbing  for 
each  of  six  different  restarting  policies: 

random:  Reassign  all  25  tiles  in  the  grid  randomly  on  each  restart. 

EEE:  Start  with  each  tile  in  the  grid  to  the  letter  ‘E’. 

SSS:  Start  with  each  tile  in  the  grid  to  the  letter  ‘S’. 

ZZZ:  Start  with  each  tile  in  the  grid  to  the  letter  ‘Z’. 

ABC:  Start  with  the  grid  set  to  ABCDE/FGHIJ/KLMNO/PQRST/UVWXY. 

cvcvc:  Assign  the  first,  third,  and  fifth  rows  of  the  grid  to  random  consonants,  and 
the  second  and  fourth  rows  of  the  grid  to  random  vowels.  High-scoring  grids 
often  have  a  pattern  similar  to  this  (or  rotations  of  this). 

The  boxplot  in  Figure  5.6  compares  the  performance  of  hillclimbing  from  these 
sets  of  states.  Apparently,  the  restarting  policy  is  irrelevant  to  hillclimbing’s  mean 
performance:  on  average,  hillclimbing  leads  to  a  Boggle  score  near  7000  no  matter 
which  of  the  above  types  of  starting  states  is  chosen.  Thus,  STAGE  cannot  learn 
useful  predictions  of  and  its  failure  to  outperform  multi-restart  hillclimbing  is 
consistent  with  our  understanding  of  how  STAGE  works. 

5.2  Empirical  Studies  of  Parameter  Choices 

We  have  shown  that  STAGE  learns  to  predict  search  performance  as  a  function  of 
starting  state  features,  and  exploits  these  predictions  to  improve  optimization  results. 
However,  STAGE’S  performance  clearly  depends  on  the  user’s  choice  of  features  F(a;), 
the  function  approximator  used  to  model  V'^,  the  policy  tt  being  learned,  and  other 
parameters.  Section  3.4  analyzed  these  choices  theoretically;  this  section  continues 
the  analysis  from  an  empirical  perspective. 
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Figure  5.6.  Boggle:  average  performance  of  50  restarts  of  hillclimbing  from  six 
different  sets  of  starting  states 


5.2.1  Feature  Sets 

Practical  domains  are  generally  abundant  in  potentially  useful  state  features.  But 
which  of  these  should  be  given  to  STAGE  for  its  learning?  Providing  many  detailed 
state  features,  perhaps  even  a  complete  description  of  the  state,  gives  STAGE  the 
opportunity  to  model  very  accurately;  however,  the  extra  features  require  more 
parameters  to  be  fit,  which  increases  the  complexity  of  the  fitter’s  task  and  may  slow 
STAGE  down  intolerably.  Conversely,  using  only  a  few  coarse  features  results  in 
efficient  fitting,  but  limits  the  prediction  accuracy  that  STAGE  can  achieve. 

I  compared  STAGE’S  performance  with  a  wide  variety  of  feature  sets  on  two 
domains:  channel  routing  (instance  YK4)  and  cartogram  design  (instance  US49). 
These  results  are  consistent  with  my  informal  experience  with  STAGE  on  many  other 
domains:  smaller  feature  sets  learn  faster  and  work  better. 

On  the  channel  routing  benchmark,  I  compared  seven  sets  of  features  for  repre¬ 
senting  any  given  state  x.  The  other  parameters  to  STAGE  are  detailed  in  Table  5.3. 
The  results  of  50  runs  with  each  feature  set  are  summarized  in  Figure  5.7  and  Ta¬ 
ble  5.4.  The  table  also  gives  the  average  final  RMS  error  for  on  each  experiment, 
but  note  that  these  numbers  are  not  directly  comparable,  since  the  training  set  distri¬ 
butions  can  be  markedly  different  from  experiment  to  experiment.  The  feature  sets 
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tried  were  as  follows: 

WPU:  {w{x)^p{x),U{x)),  the  original  features  used  in  Section  4.3  and  analyzed 
above. 

PU:  {p(x),U{x))  only,  dropping  the  objective  function  Obj(a:)  =  w{x)  from  the 
input  space.  This  performs  badly,  for  the  same  reasons  as  Experiment  Cl  of 
Section  5.1.2  above.  From  this  and  other  (unreported)  experiments,  I  conclude 
that  Obj(x)  should  generally  be  included  as  a  feature. 

PV:  (p(a;),  y(a:)),  where  V{x)  as  suggested  by  Experiment  C2 

of  Section  5.1.2  above.  This  performs  well  and,  as  can  be  seen  in  Figure  5.7, 
reaches  good  solutions  very  quickly!  This  shows  that  knowing  what  features  axe 
good  for  prediction  and  explicitly  providing  them  leads  to  great  performance. 
But  it  was  through  STAGE’S  success  with  the  original  features  that  this  more 
compact  feature  set  was  discovered. 

rrr:  (r’i(x),  r2(x),  r3(a;)),  where  rj(x)  is  a  pseudorandom  number  in  [0,1]  chosen  de¬ 
terministically  as  a  function  of  x  and  i.  That  is,  these  are  three  features  of 
the  state  which  are  completely  irrelevant  for  predicting  .  As‘  expected,  this 
performs  poorly,  similar  to  the  random- walk  experiments  of  Section  5.1.1  above. 

WPUrrr:  {w{x),p{x),U{x),ri{x),r2{x),r3{x)).  This  experiment  tests  whether  per¬ 
formance  degrades  in  the  presence  of  irrelevant  features.  The  result  is  encour¬ 
aging:  there  is  only  a  very  slight  performance  degradation  relative  to  WPU. 
Early  on  in  each  run,  learns  to  assign  near-zero  coefficients  to  the  irrelevant 
features. 

WPU2:  (u;(x),  w{xY ,p{x)^p{xY ,U{x)).  This  includes  the  two  quadratic  terms  that 
were  used  in  the  hand-tuned  evaluation  function  for  simulated  annealing  of 
[Wong  et  al.  88] .  Apparently,  these  two  terms  are  not  useful  to  STAGE,  as  they 
are  assigned  relatively  small  coefficients  and  do  not  improve  performance.  (An 
outlier  run  in  this  series  is  responsible  for  the  increased  standard  error  of  the 
mean.) 

WPU2d— 1-:  {w{x),w{xf,p{x),p{xf,U{x)Jt{i){x)Jt(2){.x),...ft{ii^){x)).  Here,  t{i) 
denotes  the  current  track  (1 . . .  t(;(a:))  on  which  subnet  i  has  been  placed,  and 
fk{x)  =  1  —  Uk{x)  denotes  the  fullness  of  track  k.  These  149  features  provide 
much  detail  of  state  x,  though  not  enough  to  reproduce  x  completely.  The 
result:  learning  occurs  slowly,  both  in  terms  of  running  time  per  iteration  (since 


120 


STAGE:  ANALYSIS 


a  150  X  150  matrix  must  be  inverted  after  each  hillclimbing  trajectory)  and  in 
terms  of  number  of  iterations  to  reach  good  performance.  However,  when  these 
runs  are  allowed  to  extend  ten  times  longer,  to  TotEvals  =  5-10®,  performance 
does  reach  the  same  excellent  level  as  the  other  good  STAGE  runs. 


Parameter 

l^etting 

TT 

stochastic  hillclimbing,  rejecting  equi-cost  moves,  patience=250 

ObjBound 

—  CX) 

features 

between  2  and  149,  depending  on  experiment 

fitter 

linear  regression 

Pat 

250 

TotEvals 

500,000  “ 

Table  5.3.  Summary  of  STAGE  parameters  for  channel  routing  feature  comparisons 


Figure  5.7.  Performance  of  different  feature  sets  on  channel  routing  instance  YK4 


The  second  experiment  with  feature  sets  is  from  the  U.S.  cartogram  domain. 
Recall  from  Section  4.6  that  the  objective  is  to  deform  the  map’s  states  so  that 
their  areas  meet  new  prescribed  targets  (in  this  instance,  proportional  to  electoral 
vote),  but  their  shapes  and  connectivity  remain  similar  to  the  un-deformed  map. 
The  objective  function  was  defined  as 

Obj(a;)  =  AArea(a;)  +  AGape(a;)  +  AOrient(x)  +  ASegfrac(x) 
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Instance 

Algorithm 

Performance  (50  runs  each) 

RMS 

mean 

best 

worst 

of  fit 

YK4 

WPU 

12.34±0.15 

11 

13 

2.7 

PU 

39.54±0.16 

38 

40 

2.8 

PV 

12.08±0.10 

11 

13 

2.8 

rrr 

32.86±0.56 

29 

36 

3.7 

WPUrrr 

12.80±0.22 

12 

16 

2.7 

WPU2 

12.92±1.11 

11 

40 

2.7 

WPU2-h-f 

21.44±1.20 

12 

30 

1.7 

Table  5.4.  Results  with  different  feature  sets  on  channel  routing  instance  YK4 


and  STAGE  used  those  four  subcomponents  of  the  objective  function  as  its  features  for 
learning.  Many  other  features  can  be  imagined  for  this  domain;  I  examine  STAGE’S 
performance  with  several  of  them  here.  These  experiments  used  linear  regression  to 
fit  V'".  Other  STAGE  parameters  are  detailed  in  Table  5.5,  and  the  results  are  shown 
in  Figure  5.8  and  Table  5.6.  The  feature  sets  compared  are  as  follows: 

Asp.l:  (Asp(a:)),  a  single  feature  describing  the  aspect  ratio  of  the  current  map’s 
bounding  rectangle.  Note  that  most  moves  leave  this  feature  unchanged.  STAGE 
learns  a  coefficient  near  zero  for  this  feature,  and  its  performance  of  Obj  0.16 
is  about  the  same  as  that  of  multi-restart  hillclimbing. 

Peri.l:  (Peri(a;)),  a  single  feature  which  sums  the  perimeter  over  all  the  states. 
STAGE  works  surprisingly  well  with  this  feature.  It  learns  a  positive  coeffi¬ 
cient  on  Peri(a:);  that  is,  it  learns  that  hillclimbing  performs  better  when  it 
begins  from  a  map  which  is  “scrunched.” 

Cop. 2:  the  horizontal  and  vertical  coordinates  of  map  x^s  center  of  population.  I 
thought  these  would  be  a  good  pair  of  features  for  the  US49  domain,  since  a 
good  cartogram  should  have  its  population  center  near  the  geometric  center  of 
the  map.  However,  these  features  were  ineffective,  under  both  linear  regression 
(shown  in  graphs)  and  quadratic  regression  (not  shown). 

Obj.4:  (Area(a;),  Gape(a;),  Orient(a:),  SegFrac(a;)),  the  four  subcomponents  of  the  ob¬ 
jective  function.  These  features  worked  quite  well,  though  not  as  well  as  Peri.l 
on  average.  Note  that  the  better  STAGE  results  of  Section  4.6  used  quadratic 
regression  over  these  four  features. 

ObjPeri.5  :  Given  the  good  performance  of  Peri.l  and  Obj.4,  it  is  natural  to  combine 
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them  into  a  5-feature  representation  for  STAGE.  This  feature  set  performed 
comparably  to  experiment  Obj.4. 

St.l5:  This  larger  feature  set  concentrates  on  five  “important”  states  of  the  map: 
New  York,  Texas,  Florida,  California,  and  Illinois.  For  each  of  these  it  provides 
three  features:  the  horizontal  and  vertical  coordinates  of  the  state’s  current 
centroid  (expressed  relative  to  the  map’s  current  bounding  rectangle)  and  the 
state’s  area.  Note  that  this  is  the  first  feature  set  I  have  tailored  specifically 
to  this  instance.  However,  STAGE’S  performance  with  these  features,  while 
better  than  multi-restart  hillclimbing,  is  worse  than  with  most  of  the  simple 
instance-independent  feature  sets  discussed  above. 

St. 147:  This  experiment  includes  the  three  centroid  and  area  features  for  all  49  of 
the  map’s  states.  With  this  many  features,  STAGE  is  significantly  slowed  down. 
For  a  fixed  number  of  moves,  however,  solution  quality  is  significantly  better 
than  multi-restart  hillclimbing. 


This  series  of  experiments  shows  that  for  almost  any  natural  choice  of  domain  features, 
STAGE  is  able  to  discover  a  structure  over  those  features  which  can  be  exploited  to 
improve  performance  significantly  over  multi-restart  hillclimbing. 


Parameter 

Setting 

TT 

stochastic  hillclimbing,  rejecting  equi-cost  moves,  patience=200 

ObjBound 

0 

features 

between  1  and  147,  depending  on  experiment 

fitter 

linear  regression 

Pat 

200 

TotEvals 

1,000,000 

Table  5.5.  Summary  of  STAGE  parameters  for  cartogram  feature  comparisons. 


From  the  experiments  reported  here  on  the  channel  routing  and  cartogram  do¬ 
mains,  and  from  other  informal  experiments  not  shown,  I  conclude  that  STAGE  works 
best  with  small,  simple  sets  of  features.  Small  feature  sets  not  only  are  speediest  for 
STAGE  to  work  with,  but  can  also,  in  cases  such  as  cartogram  Experiment  Peri.l, 
provide  a  bias  toward  good  extrapolation  by  V'^,  enabling  successful  exploration.  Fea¬ 
tures  associated  with  subcomponents  of  the  objective  function,  or  the  variance  of  such 
subcomponents,  were  often  helpful.  With  the  function  approximators  tested  here, 
STAGE  is  not  overly  sensitive  to  the  inclusion  of  extra  random  features;  however,  it 


Map  error  function 
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Figure  5.8.  Cartogram  performance  with  different  feature  sets 


Instance 

Algorithm 

Performance 

mean 

(50  runs 
best 

each) 

worst 

US49 

Asp.l 

0.158±0.004 

0.120 

0.185 

Peri.l 

0.067±0.004 

0.052 

0.144 

Cop.  2 

0.187±0.008 

0.096 

0.246 

Obj.4 

0.092±0.011 

0.039 

0.170 

ObjPeri.5 

0.085±0.008 

0.045 

0.210 

St.l5 

0.132±0.007 

0.091 

0.206 

St.l47 

0.109±0.009 

0.064 

0.205 

Table  5.6.  Cartogram  results  with  different  feature  sets 
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can  sometimes  be  drawn  into  abysmal  performance  even  worse  than  multi-restart  hill¬ 
climbing  by  a  particularly  bad  set  of  features,  as  in  channel  routing  Experiment  PU. 

5.2.2  Fitters 

Another  key  parameter  to  STAGE  is  the  function  approximator  or  “fitter”  used  to 
model  V'^{x).  In  Section  3.4.3,  I  argued  that  fitters  having  a  linear  architecture, 
such  as  polynomial  regression  (of  any  degree),  are  best  suited  to  STAGE.  Here,  I 
empirically  investigate  the  performance  of  STAGE  with  a  range  of  polynomial  fit¬ 
ters,  ranging  from  degree  1  to  degree  5,  on  both  the  channel  routing  and  cartogram 
domains. 

The  table  below  lists  the  eight  fitters  I  tested.  Let  (/i,/2,...  ,/d)  denote  the 
features  of  each  domain  given  to  STAGE;  note  that  /}  =  3  in  the  channel  routing 
domain  and  D  =  4  in  the  cartogram  domain.  For  each  fitter,  the  table  also  gives  the 
total  number  of  coefficients  being  fit,  both  as  a  function  of  arbitrary  D  and  in  the 
case  where  D  =  A. 


Key  Description 

#  params 

FI 

linear  regression 

D-M  =  5 

F2 

quadratic  regression 

("D  =  15 

F3 

cubic  regression 

i^f)  =  35 

F4 

quartic  regression 

=  70 

E2 

quadratic  regression  without  cross-terms. 
For  each  domain  feature  /j,  this  model  in¬ 
cludes  the  terms  fi  and  /?  but  no  cross¬ 
terms  involving  the  product  of  more  than 
one  feature. 

2T>  -h  1  =  9 

E3 

cubic  regression  without  cross-terms 

W  +  l  =  13 

E4 

quartic  regression  without  cross-terms 

AD +  1  =  11 

E5 

quintic  regression  without  cross-terms 

5T>  -t- 1  =  21 

Results  are  given  in  the  usual  form  in  Tables  5.7  and  5.8  and  Figures  5.9  and  5.9. 
On  both  domains,  the  results  indicate  that  the  choice  of  fitter  has  a  relatively  minor 
impact  on  performance.  On  the  channel  routing  domain,  slightly  better  results  and 
fewer  poor  outlier  runs  were  obtained  from  the  most  highly  biased  models,  linear 
regression  and  crossterm-less  quadratic  regression.  This  is  consistent  with  our  earlier 
observation  that  the  linear  approximation  V'^{x)  k,  10u;(a;)  -f  0.5p(a;)  -  1017(a;)  -  80 
suffices  for  excellent  performance  here.  On  the  cartogram  domain,  however,  somewhat 
more  complex  models  performed  slightly  better. 

In  these  experiments,  STAGE’S  computational  overhead  for  function  approxima¬ 
tion  was  not  a  significant  component  of  running  time,  even  for  the  most  complex 
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model  here,  quartic  regression  with  all  cross-terms.  However,  since  the  expense  of 
least-squares  quartic  regression  increases  as  0(D^^),  this  overhead  would  certainly 
become  prohibitive  for  more  than  D  =  4  features.  In  my  experience,  quadratic  regres¬ 
sion  (either  with  or  without  cross  terms)  provides  sufficient  model  flexibility,  enough 
bias  to  enable  aggressive  extrapolation,  and  efficient  computational  performance. 


Instance 

Algorithm 

Performance  (50  runs  each) 

mean 

best 

worst 

YK4 

FI  (linear) 

12.26±0.17 

11 

14 

F2  (quadratic) 

14.30±1.16 

12 

37 

F3  (cubic) 

13.08±0.35 

12 

19 

F4  (quartic) 

12.86±0.42 

11 

19 

E2  (crossterm-less  quadratic) 

12.30±0.14 

11 

13 

E3  (crossterm-less  cubic) 

12.8O±1.07 

12 

39 

E4  (crossterm-less  quartic) 

13.54±1.15 

11 

39 

E5  (crossterm-less  quintic) 

15.02±0.71 

12 

24 

Table  5.7.  Channel  routing  results  with  different  polynomial  function  approximators 
over  the  3  features  {w,p,  U) 


Instance 

Algorithm 

Performance  (50  runs 

each) 

mean 

best 

worst 

US49 

(HC) 

0.174±0.002 

0.152 

0.195 

FI  (linear) 

0.092±0.011 

0.039 

0.170 

F2  (quadratic) 

0.057±0.004 

0.038 

0.103 

F3  (cubic) 

0.057±0.004 

0.038 

0.126 

F4  (quartic) 

0.069±0.003 

0.045 

0.096 

E2  (crossterm-less  quadratic) 

0.078±0.010 

0.041 

0.174 

E3  (crossterm-less  cubic) 

0.067±0.005 

0.038 

0.135 

E4  (crossterm-less  quartic) 

0.058±0.004 

0.040 

0.111 

E5  (crossterm-less  quintic) 

0.069±0.005 

0.041 

0.106 

Table  5.8.  Cartogram  performance  with  different  polynomial  function  approxima¬ 
tors  over  the  4  features  (AArea,  AGape,  AOrient,  ASegfrac).  All  the  STAGE  runs 
significantly  outperform  multi-restart  hillclimbing  (HC). 
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Figure  5.9.  Performance  of  different  function  approximators  on  channel  routing 
instance  YK4 
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Figure  5.10.  Cartogram  performance  with  different  polynomial  function  approxi¬ 
mators 


§5.2  EMPIRICAL  STUDIES  OF  PARAMETER  CHOICES 


127 


5.2.3  Policy  tt 

Thus  far,  I  have  discussed  how  the  user’s  choice  of  features  and  fitters  affects  STAGE’S 
ability  to  learn  and  exploit  V'^{F{x)).  But  what  about  the  choice  of  tt  itself?  In 
every  domain  of  Chapter  4  besides  Boolean  satisfiability,  tt  was  chosen  to  be  a  very 
simple  search  procedure:  stochastic  first-improvement  hillclimbing,  rejecting  equi- 
cost  moves.  But  clearly,  stronger  local  search  procedures  are  available.  For  example, 
stochastic  hillclimbing  accepting  equi-cost  moves  often  performs  better,  and  simulated 
annealing  performs  better  still.  Can  STAGE  learn  to  improve  upon  these,  and  achieve 
yet  better  performance? 

Theoretically,  even  if  a  procedure  tti  is  significantly  better  than  another  procedure 
•K2,  it  is  by  no  means  guaranteed  that  STAGE  on  tti  will  outperform  STAGE  on 
■K2.  For  example,  tt2  may  produce  more  diverse  outcomes  as  a  function  of  starting 
state,  enabling  useful  extrapolation  in  whereas  if  tti  is  expected  to  reach  the 
same  solution  quality  no  matter  what  state  it  starts  from- — as  was  the  case  with  the 
Boggle  example  of  Section  5.1.4 — then  F”*  will  be  flat,  and  STAGE  will  not  produce 
improvement. 

In  this  section,  I  consider  the  following  three  policies  on  the  cartogram  domain: 

•  TTi:  regulax  first-improvement  hillclimbing,  rejecting  equi-cost  moves,  patience=200. 

•  tt2:  hillclimbing  as  above,  but  modified  so  that  equi-cost  and  some  slightly 
harmful  moves  are  accepted.  Specifically,  a  move  is  rejected  if  and  only  if  it 
worsens  Obj  by  more  than  5=0.0001.  (This  value  of  5  was  the  most  effective 
of  a  wide  range  tried  on  this  domain.)  On  average,  tt2  performs  significantly 
better  than  ttj. 

•  TTs:  simulated  annealing  with  a  shortened  schedule  length  of  20,000  moves.  The 
short  schedule  allows  it  to  be  used  in  the  context  of  a  multi-restart  procedure. 

For  each  of  these  three  policies,  I  sought  to  compare  the  default  restarting  method  to 
a  “smart”  restarting  method  learned  by  STAGE.  In  the  cases  of  tti  and  7r2,  the  default 
restarting  method  is  to  reset  to  the  domain’s  initial  state  (the  original  undeformed 
U.S.  map).  In  the  case  of  tts,  I  found  that  a  better  default  restarting  method  is  to 
start  each  new  annealing  schedule  in  the  same  state  where  the  previous  trajectory 
finished.  These  procedures  are  labelled  PIl,  PI2  and  PIS  in  the  results  presented 
below. 

In  applying  STAGE  to  7r2  and  tts,  a  complication  arises:  neither  of  these  proce¬ 
dures  is  Markovian.  Methods  for  coping  with  this  theoretical  difficulty  were  discussed 
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in  detail  at  the  end  of  Section  3.4.1  (p.  61),  and  in  the  context  of  WALKSAT  in  Sec¬ 
tion  4.7.2.  In  the  case  of  772,  a  simple  fix  is  to  train  on  only  those  states  which 

are  the  best-so-far  on  their  trajectory.  (Note  that  on  the  pure  hillclimbing  trajecto¬ 
ries  generated  by  tti,  every  state  is  the  best-so-far.)  These  procedures  are  labelled 
STAGE(PIl)  and  STAGE(PI2)  in  the  results.  In  the  case  of  tts,  even  training  only 
on  best-so-far  states  does  not  meike  theoretically  well-defined  (though  results 

for  STAGE(PI3)  are  given  below  nonetheless).  Instead,  giving  up  on  the  Markov 
property,  I  resort  to  training  on  just  the  single  starting  state  of  each  simulated 
annealing  trajectory.  This  variant  of  STAGE  is  well-defined  for  any  proper  policy  tt; 
its  results  for  all  three  of  our  policies  are  shown  as  STAGEO(PIl),  STAGE0(PI2),  and 
STAGE0(PI3). 

The  results,  displayed  in  Table  5.9  and  Figures  5.11-5.13,  point  to  several  conclu¬ 
sions.  First,  as  demonstrated  by  Experiment  STAGE(PI2),  STAGE  can  successfully 
learn  from  a  non-Markovian  policy  by  training  only  on  best-so-far  states.  Second, 
from  Experiment  STAGE(PI3)  we  can  conclude  that  STAGE  may  not  be  effective 
at  learning  evaluation  functions  from  simulated  annealing;  this  is  unsurprising,  since 
SA  s  initial  period  of  random  search  makes  the  outcome  quite  unpredictable  from 
the  starting  state.  Third,  from  the  set  of  STAGED  experiments,  it  is  apparent  that 
STAGE  learns  much  more  quickly  when  it  is  able  to  train  on  entire  trajectories,  rather 
than  just  starting  states.  Thus,  STAGE  is  able  to  exploit  the  Markov  property  of 
for  efficient  performance. 


Instance 

Algorithm 

Performance  (50  runs 

each) 

mean 

best 

worst 

US49 

PIl  =  hillclimbing 

0.174±0.002 

0.152 

0.192 

STAGE(PIl) 

0.057±0.004 

0.038 

0.103 

STAGEO(PIl) 

0.099±0.013 

0.042 

0.174 

PI2  =  hillclimbing/^=0.0001 

0.140±0.003 

0.115 

0.167 

STAGE(PI2) 

0.052±0.003 

0.040 

0.083 

STAGE0(PI2) 

0.077±0.009 

0.045 

0.136 

PI3  =  simulated  annealing 

0.049±0.001 

0.044 

0.070 

STAGE(PI3) 

0.050±0.002 

0.042 

0.091 

STAGE0(PI3) 

0.060±0.005 

0.043 

0.142 

Table  5.9.  Cartogram  performance  with  STAGE  learning  from  a  variety  of 
different  choices  of  policy  tt .  All  algorithms  were  limited  to  considering  TotEvals  = 
10®  moves. 


Map  error  function  Map  error  function 
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Figure  5.11.  Cartogram  performance  of  STAGE  using  tti  =  hillclimbing 


Figure  5.12.  Cartogram  performance  of  STAGE  using  n2  =  hillclimbing  accepting 
equi-cost  and  slightly  harmful  moves 
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Figure  5.13.  Cartogram  performance  of  STAGE  using  ws  =  simulated  annealing 


5.2.4  Exploration/Exploitation 

In  any  time-bounded  search  process  there  arises  a  tradeoff  between  exploitation,  i.e., 
pursuing  what  appears  to  be  the  best  path  given  the  limited  observations  made  thus 
far,  and  exploration,  i.e.,  trying  paths  about  which  little  is  yet  known.  Treating  this 
tradeoff  optimally  is  in  general  intractable,  and  various  heuristics  have  been  proposed 
in  the  reinforcement  learning  literature  [Thrun  92,  Moore  and  Atkeson  93,Dearden  et 
al.  98].  For  its  exploration,  STAGE  relies  primarily  on  extrapolation  by  the  function 
approximator  to  guide  search  to  unvisited  regions,  and  secondarily  on  random  restarts 
when  search  stalls. 

However,  there  is  another  aspect  of  STAGE’S  search  strategy  than  impinges  on 
the  exploration/exploitation  dilemma.  On  each  iteration,  STAGE  runs  tt  to  generate 
a  trajectory  {xq.-.xt),  uses  this  trajectory  to  update  ,  and  then  searches  for  a 
good  starting  state  of  tt  by  hillclimbing  on  the  new  This  section  investigates  the 
design  decision  of  where  to  begin  each  search  of 

Continue:  This  is  STAGE’S  normal  policy — namely,  to  begin  each  search  of  V'^  by 
simply  continuing  from  the  current  local  optimum  xy. 

Begin  each  search  of  from  a  random  initial  state.  This  promotes  greater 
global  exploration  of  the  space,  at  the  cost  of  losing  the  benefit  of  the  work  just 
done  by  tt  to  reach  a  local  optimum  of  Obj. 
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Compromises  between  Continue  and  Re-init  are  possible.  Let  Re-init(fr) 
denote  the  policy  of  jumping  to  a  random  initial  state  if  and  only  if  K  consec¬ 
utive  STAGE  iterations  have  failed  to  improve  on  the  best  optimum  yet  seen. 
Thus,  Re-init(O)  always  starts  from  a  random  state,  and  Re-init(oo)  is  the 
same  as  the  Continue  policy.  I  tested  a  range  of  settings  for  K. 

Best-ever:  Begin  each  search  of  V'^  from  the  best  state  seen  so  far  on  this  run. 
This  promotes  greater  exploitation  of  a  known  excellent  solution,  at  the  cost  of 
possibly  becoming  over-focused  on  one  area  of  search  space. 

Again,  compromises  are  possible.  In  this  case,  I  implemented  Best-ever(p), 
which  on  each  iteration  either  jumps  to  the  best-so-far  state  with  probability 
p  or  continues  from  the  current  state  xt  with  probability  1  —  p.  Note  that 
Best-ever(O),  like  Re-init(oo),  is  equivalent  to  Continue. 

I  compared  Continue,  Re-init (0),  Re-init (5),  Re-init (20),  Best-ever(O.l), 
Best-ever (0.4),  Best-ever(0.7),  and  Best-ever(l)  on  cartogram  domain  US49. 
The  results  are  given  in  Table  5.10  and  Figure  5.10.  The  results  show  that  frequent 
jumping  to  a  random  state,  as  Re-init(iF)  does  for  small  K,  does  negatively  affect 
solution  quality,  whereas  frequent  jumping  to  the  best-so-far  state,  as  done  by  Best- 
ever  (p),  neither  helps  performance  nor  hurts  it  significantly.  STAGE’S  Continue 
policy  seems  to  strike  an  acceptable  balance  between  exploration  and  exploitation. 


Instance 

Algorithm 

Performance  (50  runs  each) 

mean 

best 

worst 

US49 

Continue  (normal  STAGE) 

0.057±0.003 

0.037 

0.103 

Re-init  (0) 

0.171±0.003 

0.150 

0.197 

Re-init  (2) 

0.072±0.004 

0.045 

0.112 

Re-init  (5) 

0.059±0.004 

0.040 

0.109 

Re-init  (20) 

0.056±0.005 

0.037 

0.114 

Best-ever(O.l) 

0.056±0.004 

0.037 

0.100 

Best-ever(0.4) 

0.062±0.006 

0.038 

0.147 

Best-ever(0.7) 

0.060±0.005 

0.041 

0.129 

Best-ever(l.O) 

0.062±0.004 

0.039 

0.113 

Table  5.10.  Cartogram  performance  under  various  modifications  of  STAGE’S  policy 
for  initializing  search  of  V'^  on  each  round 
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Figure  5.14.  Cartogram  performance  under  various  modifications  of  STAGE’S  pol¬ 
icy  for  initializing  search  of  V'^  on  each  round 


5.2.5  Patience  and  ObjBound 

Finally,  I  investigated  the  effect  of  the  two  remaining  STAGE  parameters:  Pat  and 
ObjBound.  Recall  from  the  STAGE  algorithm  (p.  52)  that  these  parameters  jointly 
determine  when  STAGE’s  second  phase  of  search,  stochastic  hillclimbing  on  F’", 
should  terminate.  ObjBound  corresponds  to  a  known  lower  bound  on  the  objec¬ 
tive  function;  it  cuts  the  search  off  when  F’'^(F(a;))  predicts  that  x  is  an  impossibly 
good  starting  state,  i.e.,  V^{F{x))  <  ObjBound.  This  cutoff  may  be  disabled  by 
setting  ObjBound  =  — oo.  The  other  parameter,  Pat,  cuts  the  search  off  when  too 
many  consecutive  moves  have  failed  to  improve  V'^ .  If  either  of  these  parameters  is 
set  too  aggressively  (too  tight  a  bound,  too-low  patience),  then  STAGE  will  fail  to 
reach  the  best  starting  points  predicted  by  its  function  approximator.  If,  on  the  other 
hand,  these  are  set  too  loosely,  then  STAGE  may  waste  valuable  time  that  could  in¬ 
stead  have  been  spent  on  the  first  phase  of  search,  namely,  running  tt  and  gathering 
new  training  data.  How  sensitive  is  STAGE  to  the  precise  settings  used? 

Experimental  results  of  varying  these  parameters  on  the  cartogram  domain  are 
presented  in  Table  5.11  and  Figures  5.15  and  5.16.  STAGE’s  other  parameters  were 
set  as  in  the  cartogram  experiments  of  Section  4.6  (p.  95).  The  results  show  that 
stage’s  performance  suffered  when  Pat  was  very  low  or  very  high,  but  worked  well 
for  a  wide  range  of  settings  between  64  and  1024.  As  for  the  Obj  BOUND  parameter. 
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STAGE  performed  best  when  it  was  set  to  its  true  bound  of  0,  but  degraded  gracefully 
when  it  was  set  to  —1  (too  loose  a  bound)  or  +0.1  (too  tight).  Performance  was, 
however,  significantly  worse  for  ObjBound=oo,  demonstrating  that  the  ObjBound 
parameter  is  useful.  By  cutting  off  search  on  V'^  when  the  function  approximator  is 
provably  inaccurate,  Obj BOUND  saves  computation  time  and  prevents  search  from 
wandering  too  far  astray. 


Instance 

Algorithm 

Performance 

mean 

'50  runs 
best 

each) 

worst 

US49 

Pat=16 

0.161+0.015 

0.079 

0.285 

Pat=32 

0.095+0.008 

0.040 

0.147 

Pat=64 

0.065+0.006 

0.040 

0.151 

Pat=128 

0.057+0.004 

0.039 

0.114 

Pat=256 

0.059+0.007 

0.038 

0.190 

Pat=512 

0.062+0.006 

0.041 

0.134 

Pat=1024 

0.070+0.005 

0.044 

0.126 

Pat=2048 

0.100+0.006 

0.046 

0.159 

Pat=4096 

0.126+0.008 

0.074 

0.187 

Pat=8192 

0.122+0.007 

0.057 

0.172 

Pat=16384 

0.132+0.006 

0.068 

0.178 

Pat=32768 

0.129+0.007 

0.079 

0.189 

Pat=65536 

0.135+0.007 

0.081 

0.184 

ObjBound=-oo 

0.083+0.011 

0.036 

0.189 

ObjBound=-1 

0.060+0.006 

0.037 

0.143 

ObjBound=0 

0.057+0.004 

0.038 

0.103 

ObjBound=0.1 

0.061+0.006 

0.040 

0.171 

Table  5.11.  Cartogram  performance  with  varying  settings  of  Pat  and  ObJ BOUND 


5.3  Discussion 

This  chapter  gives  depth  to  the  results  of  the  previous  chapter,  demonstrating  that 
STAGE  does  indeed  obtain  its  success  by  exploiting  the  power  of  reinforcement  learn¬ 
ing,  and  that  it  works  reliably  over  a  wide  range  of  parameter  settings.  The  chapter’s 
empirical  conclusions  may  be  summarized  as  follows: 

•  Evaluation  functions  other  than  STAGE’S  learned  V'^ ,  built  at  random  or  by 
simply  smoothing  the  objective  function,  perform  much  worse  than  V'^  in  the 
context  of  multi-start  optimization.  STAGE  successfully  learns  and  ex¬ 
ploits  the  predictive  power  of  value  functions. 


Map  error  function 
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Figure  5.15.  Cartogram  performance  with  various  patience  settings 


Figure  5.16.  Cartogram  performance  with  different  ObjBound  levels 
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•  When  the  baseline  policy  tt  does  not  demonstrate  predictable  trends  as  a  func¬ 
tion  of  state  features,  then  will  not  be  able  to  guide  STAGE  to  promising 
new  starting  states,  so  STAGE  will  not  improve  (nor  hurt)  performance  com¬ 
pared  to  multi-start  tt.  This  was  demonstrated  for  hillclimbing  in  the  Boggle 
domain  and  for  simulated  annealing  in  the  cartogram  domain. 

•  For  the  policy  of  7r=stochastic  hillclimbing,  STAGE  can  empirically  learn  a  use¬ 
ful  global  structure  from  many  different  natural  choices  of  feature  sets.  STAGE 
is  robust  in  the  presence  of  irrelevant  features,  though  it  works  most  efficiently 
and  effectively  with  small  feature  sets.  As  a  choice  of  function  approximator, 
quadratic  regression  provides  sufficient  model  flexibility,  enough  bias  to  enable 
aggressive  extrapolation,  and  efficient  computational  performance. 

•  Empirically,  STAGE  makes  reasonable  tradeoffs  between  exploration  and  ex¬ 
ploitation,  and  it  performs  robustly  over  a  wide  range  of  settings  for  the  param¬ 
eters  Pat  and  ObjBound. 
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Chapter  6 

STAGE:  Extensions 


The  preceding  chapters  have  demonstrated  the  effectiveness  of  a  simple  idea: 
learned  predictions  of  search  outcomes  can  be  used  to  improve  future  search  out¬ 
comes.  In  this  chapter,  I  consider  two  independent  extensions  to  the  basic  STAGE 
algorithm: 

•  Section  6.1  investigates  the  use  of  the  TD(A)  family  of  algorithms,  including 
a  new  least-squares  formulation,  for  making  more  efficient  use  of  memory  and 
data  while  learning  V^. 

•  Section  6.2  illustrates  how  STAGE’S  training  time  on  a  problem  instance  may 
be  reduced  by  transferring  learned  V"^  functions  from  previously  solved,  similar 
problem  instances. 

6.1  Using  TD(A)  to  learn  , 

In  almost  all  of  the  experimental  results  of  Chapters  4  and  5,  STAGE  learned  to 
approximate  for  a  particularly  simple  choice  of  tt:  stochastic  hillclimbing  with 
equi-cost  moves  rejected.  This  procedure  is  proper,  Markovian,  and  monotonic,  and 
therefore — as  shown  earlier  in  Section  3.4.1 — induces  a  Markov  chain  over  the  config¬ 
uration  space  X.  V^^x)  is  precisely  the  value  function  of  the  chain. 

To  approximate  ,  STAGE  uses  simple  Monte-Carlo  simulation  and  linear  least- 
squares  function  approximation,  as  described  in  [Bertsekas  and  Tsitsiklis  96,  §6.2.1]. 
However,  a  reinforcement  learning  technique,  TD(A)  or  temporal  difference  learning^ 
also  applies.  TD(A)  is  a  family  of  incremental  gradient-descent  algorithms  for  approx¬ 
imating  ,  parametrized  by  A  €  [0, 1]  [Sutton  88].  For  the  case  of  A  =  0,  Bradtke 
and  Barto  [96]  demonstrated  improved  data  efficiency  with  a  least-squares  formula¬ 
tion  of  the  algorithm,  which  he  called  LSTD(O).  LSTD(O)  also  eliminates  all  stepsize 
parameters  from  the  TD  procedure. 

In  this  section,  I  generalize  Bradtke  and  Barto’s  results  to  arbitrary  values  of 
A  €  [0, 1],  drawing  on  the  analyses  of  TD(A)  in  [Tsitsiklis  and  Roy  96,  Bertsekas  and 
Tsitsiklis  96].  I  show  that  LSTD(l)  produces  the  same  coefficients  as  Monte-Carlo 
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simulation  with  linear  regression,  but  requires  less  memory  (and  no  extra  computa¬ 
tion).  I  also  explain  how  LSTD(A)  bridges  the  gap  between  model-free  and  model- 
based  RL  algorithms.  Finally,  I  demonstrate  empirical  results  with  STAGE  in  the 
Bayes  net  structure-finding  domain,  showing  that  LSTD(A)  with  A  set  to  values  less 
than  1  can  sometimes  learn  an  effective  approximation  of  from  less  simulation 
data  than  supervised  linear  regression. 

6.1.1  TD(A):  Background 

TD(A)  addresses  the  problem  of  computing  the  value  function  of  a  Markov  chain, 
or  equivalently,  of  a  fixed  policy  tt  in  a  Markov  Decision  Problem.  This  is  an  important 
subproblem  of  several  algorithms  for  sequential  decision  making,  including  policy 
iteration  [Bertsekas  and  Tsitsiklis  96]  and  STAGE.  V^{x)  simply  predicts  the  expected 
long-term  sum  of  future  rewards  obtained  when  starting  from  state  x  and  following 
policy  TT  until  termination.  This  function  is  well-defined  as  long  as  tt  is  proper,  i.e., 
guaranteed  to  terminate.^ 

For  small  Markov  chains  whose  transition  probabilities  are  all  explicitly  known, 
computing  is  a  trivial  matter  of  solving  a  system  of  linear  equations;  TD(A)  is  not 
needed.  However,  in  many  practical  applications,  the  transition  probabilities  of  the 
chain  are  available  only  implicitly:  either  in  the  form  of  a  simulation  model  or  in  the 
form  of  an  agent’s  actual  experience  executing  tt  in  its  environment.  In  either  case,  we 
must  compute  or  an  approximation  to  solely  from  a  collection  of  trajectories 
sampled  from  the  chain.  This  is  where  the  TD(A)  family  of  algorithms  applies. 

TD(A)  was  introduced  in  [Sutton  88];  excellent  summaries  may  now  be  found  in 
several  books  [Bertsekas  and  Tsitsiklis  96, Sutton  and  Barto  98].  For  each  state  on  an 
observed  trajectory  (xq,  a;i , . . .  ,  xl,  END),  TD(A)  incrementally  adjusts  the  coefficients 
of  V'^  to  more  closely  satisfy 

r’M  =  (1  -  A)  A‘-‘ (V"(;rj«)  +  E  ■=,  iJi)  + 

k=t  (6.1) 

where  Rj  is  a  shorthand  for  the  one-step  reward  R(xj,Xj+i).  The  right  hand  side  of 
Equation  6.1  can  be  interpreted  as  computing  the  weighted  average  oi  L-t  different 
lookahead-based  estimates  for  V'^(xt).  The  different  estimates  are  the  1-step  trun¬ 
cated  return  (Rt  -f  V^{xt+i)),  2-step  truncated  return  Rt  d-  Rt+i  +  V^{xt+2),  and  so 

For  improper  policies,  may  be  made  well-defined  by  the  use  of  a  discount  factor  that  expo¬ 
nentially  reduces  future  rewards;  the  TD(A)  and  LSTD(A)  algorithms  both  extend  straightforwardly 
to  that  case.  However,  for  simplicity  I  will  assume  here  that  is  undiscounted  and  that  tt  is  proper. 
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forth  up  to  the  total  Monte-Carlo  return  [Rf  -f  Rt-^i  -t-  •  •  •  -|-  Rl).  The  relative  weights 
of  these  estimates  are  determined  by  the  terms  involving  A,  which  sum  to  unity.  The 
A  parameter  smoothly  interpolates  between  two  extremes: 

•  TD(1):  adjust  V'^  based  only  on  the  Monte-Carlo  return.  This  gives  rise  to  an 
incremental  form  of  supervised  learning. 

•  TD(0):  adjust  V'^  based  only  on  the'l-step  lookahead  Rt  -f  V^{xt+i).  This 
gives  rise  to  an  incremental,  sampled  form  of  Value  Iteration. 

The  target  values  assigned  by  TD(1)  are  unbiased  samples  of  V^,  but  may  have 
significant  variance  since  each  depends  on  a  long  sequence  of  rewards  from  stochastic 
transitions.  By  contrast,  TD(0)’s  target  values  have  low  variance — the  only  random 
component  is  the  reward  of  a  single  state  transition — but  are  biased  by  the  potential 
inaccuracy  of  the  current  estimate  of  V‘^.  The  parameter  A  trades  off  between  bias 
and  variance.  Empirically,  intermediate  values  of  A  seem  to  perform  best  [Sutton 
88,  Singh  and  Sutton  96,  Sutton  96]. 

A  convenient  form  of  the  TD(A)  algorithm  is  given  in  Table  6.1.1.  This  version 
of  the  algorithm  assumes  that  the  policy  tt  is  proper,  that  the  approximation  archi¬ 
tecture  is  linear  (as  described  in  Section  3.4.3),  and  that  updates  are  offline  (i.e.,  the 
coefficients  of  V  are  modified  only  at  the  end  of  each  trajectory).  On  each  transition, 
the  algorithm  computes  the  scalar  one-step  TD  error  Rt  +  (^(xt+i)  -  (t>{xt)y and 
apportions  that  error  among  all  state  features  according  to  their  respective  eligibilities 
Zf.  The  eligibility  vector  may  be  seen  as  an  algebraic  trick  by  which  TD(A)  propa¬ 
gates  rewards  backward  over  the  current  trajectory  without  having  to  remember  the 
trajectory  explicitly.  Each  feature’s  eligibility  at  time  t  depends  on  the  trajectory’s 
history  and  on  A: 

< 

k=to 

where  to  is  the  time  when  the  current  trajectory  started.  In  the  case  of  TD(0),  only 
the  current  state  s  features  are  eligible  to  be  updated,  so  z<  =  ^(a;^);  whereas  in 
features  of  all  states  seen  so  far  on  the  current  trajectory  are  eligible,  so 

~  ^k=to 

The  reason  for  the  restriction  to  linear  approximation  architectures  is  that  TD(A) 
provably  converges  when  such  architectures  are  used,  under  a  few  mild  additional  as¬ 
sumptions  detailed  below  [Tsitsiklis  and  Roy  96].  The  currently  available  convergence 
results  may  be  summarized  as  follows  [Bertsekas  and  Tsitsiklis  96]: 
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TD(A)  for  approximate  policy  evaluation: 

Given: 

•  a  simulation  model  for  a  proper  policy  tt  in  MDP  X\ 

•  a  featurizer  (f>  :  X  —)■  3?^  mapping  states  to  feature  vectors,  ^(end)  =  0; 

•  a  parameter  A  G  [0, 1];  and 

•  a  sequence  of  stepsizes  oi,  025  •  •  •  for  incremental  coefficient  updating. 
Output:  a  coefficient  vector  /3  for  which  V^{x) 

Set  :=  0  (arbitrary  initial  estimate),  t  :=  0. 
for  n  :=  1, 2, . . .  do:  { 

Set  ^  :=  0. 

Choose  a  start  state  Xt  €.  X. 

Set  Zt  :=  (f>{xt). 
while  xt  ^  END,  do:  { 

Simulate  one  step  of  the  chain,  producing  a  reward  Rt  and  next  state  Xt+i. 
Set  ^  :=  +  Zt{Rt  +  {(f>{xt+i)  -  (j>{xt)y^). 

Set  Zt-\-i  :=  Xzt  +  (f>(xt+i). 

Set  t  :=  t  +  1. 

} 

Set  f3  ^  anS. 

} 

LSTD(A)  for  approximate  policy  evaluation: 

Given:  a  simulation  model,  featurizer,  and  A  as  above;  no  stepsizes. 

Output:  a  coefficient  vector  jS  for  which  V^^x)  Ri  (3  ■  <f){x). 

Set  A  :=  0,  b  :=  0,  t  :=  0. 
for  n  :=  1, 2, . . .  do:  { 

Choose  a  start  state  Xt  G  X. 

Set  Zt  :=  (f>{xt). 
while  Xt  ^  END,  do:  { 

Simulate  one  step  of  the  chain,  producing  a  reward  Rt  and  next  state  Xf+i. 
Set  A  :=  A  +  Zt((f(xt)  -  <f(xt+i))'’^. 

Set  b  :=  b  +  ZtRt- 

Set  Zf+i  :==  Azj  +  ^(xt+i). 

Set  t  :=  t  +  1. 

} 

Whenever  updated  coefficients  are  desired:  Set  (3  :=  A  ^b.  (Use  SVD.) 

}  _ ^ _ 


Table  6.1.  Gradient-descent  and  least-squares  versions  of  trial-based  TD(A)  for 
approximating  the  undiscounted  value  function  of  a  fixed  proper  policy.  Note  that  A 
has  dimension  K  x  K,  and  b,  /3,  8,  z,  and  (j>{x)  all  have  dimension  K  x  1. 
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•  If  the  approximator  is  nonlinear,  TD(A)  may  diverge  [Bertsekas  and  Tsitsiklis 
96],  though  it  has  been  successful  with  nonlinear  neural  networks  in  practice. 

•  If  the  function  approximator  is  linear  and  can  represent  the  optimal  exactly, 
then  TD(A)  will  converge  to  V".  This  result  applies  primarily  to  representations 
with  one  independent  feature  per  state,  i.e.,  lookup  tables. 

•  If  the  approximator  is  linear  but  cannot  represent  exactly,  then  TD(1)  will 
converge  to  the  best  least-squares  fit  of  ,  For  values  of  A  <  1,  TD(A)  will 
also  converge,  but  possibly  to  a  suboptimal  fit  of  the  error  bound  worsens 
as  A  decreases  toward  zero  [Bertsekas  and  Tsitsiklis  96].  In  practice,  though, 
smaller  values  of  A  introduce  less  variance  and  may  enable  TD(A)  to  converge 
to  an  acceptable  approximation  of  using  less  data. 

Sufl&cient  conditions  for  the  convergence,  with  probability  1,  of  TD(A)  are  as  follows: 

1.  The  stepsizes  are  nonnegative,  satisfy  ctn  =  oo?  and  satisfy  ol\  < 

oo.  (The  stepsizes  a„  =  c/n  satisfy  this  condition,  though  in  practice  a  small 
constant  stepsize  is  often  used.) 

2.  All  states  x  e  X  have  positive  probability  of  being  visited  given  the  start  state 
distribution  chosen.  (This  is  a  technicality:  unvisited  states  may  simply  be 
deleted  from  the  chain.) 

3.  The  features  are  linearly  independent  of  one  another  over  X.  That  is,  the 
feature  set  is  not  redundant. 

What  does  TD(A)  converge  to?  Examining  the  update  rule  for  S  in  Table  6.1.1, 
it  is  not  difficult  to  see  that  the  coefficient  changes  made  by  TD(A)  after  an  observed 
trajectory  (xq,  xi, . . .  ,  xl,  end)  have  the  form 

/3:=  ^  +  a„(d  -t-  C/3  -f  uj) 

where 

L 

d  =  E{Y^ZiR{xi,Xi+i)} 
i=0 
L 

C  =  E{^  Zi{(t>{Xi+i)  -  </>(a:i))^} 

e=0 


(jj  =  zero-mean  noise. 
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The  expectations  are  taken  with  respect  to  the  distribution  of  trajectories  through 
the  Markov  chain.  It  is  shown  in  [Bertsekas  and  Tsitsiklis  96]  that  C  is  negative 
definite  and  that  the  noise  has  sufficiently  small  variance,  which  together  with  the 
stepsize  conditions  given  above,  imply  that  (3  converges  to  a  fixed  point  /3x  satisfying 

d  +  C/3a  =  0. 

In  effect,  TD(A)  solves  this  system  of  equations  by  performing  stochastic  gradient 
descent  on  the  potential  function  \\f3  —  /3a|P-  It  never  explicitly  represents  d  or  C. 
The  changes  to  /3  depend  only  on  the  most  recent  trajectory,  and  after  those  changes 
are  made,  the  trajectory  and  its  rewards  are  simply  forgotten.  This  approach,  while 
requiring  little  computation  time  per  iteration,  is  wasteful  with  data  and  may  require 
sampling  many  trajectories  to  reach  convergence. 

One  technique  for  using  data  more  efficiently  is  “experience  replay”  [Lin  93]:  ex¬ 
plicitly  remember  all  trajectories  ever  seen,  and  whenever  asked  to  produce  an  up¬ 
dated  set  of  coefficients,  perform  repeated  passes  of  TD(A)  over  all  the  saved  tra¬ 
jectories  until  convergence.  This  technique  is  similar  to  the  batch  training  methods 
commonly  used  to  train  neural  networks.  However,  in  the  case  of  linear  function 
approximators,  there  is  another  way. 

6.1.2  The  Least-Squares  TD(A)  Algorithm 

The  Least-Squares  TD(A)  algorithm,  or  LSTD(A),  converges  to  the  same  coefficients 
/3a  that  TD(A)  does.  However,  instead  of  performing  gradient  descent,  LSTD(A) 
builds  explicit  estimates  of  the  C  matrix  and  d  vector  (actually,  estimates  of  a  con¬ 
stant  multiple  of  C  and  d),  and  then  effectively  solves  d  -|-  C/3a  =  0  directly.  The 
actual  data  structures  that  LSTD(A)  builds  from  experience  are  the  matrix  A  (of 
dimension  K  xK,  where  K  is  the  number  of  features)  and  the  vector  b  (of  dimension 
K): 

t 

4=0 

t 

After  n  independent  trajectories  have  been  observed,  b  is  an  unbiased  estimate  of 
ud,  and  A  is  an  unbiased  estimate  of  — nC.  Thus,  j3\  can  be  estimated  as  A~^b. 
As  in  the  least-squares  linear  regression  technique  of  Section  3.4.3,  I  use  Singular 
Value  Decomposition  to  invert  A  robustly  [Press  et  al.  92].  The  complete  LSTD(A) 
algorithm  is  specified  in  the  bottom  half  of  Table  6.1.1. 
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LSTD(A)  is  a  generalization  of  the  LSTD(O)  algorithm  [Bradtke  and  Barto  96]  to 
the  case  of  arbitrary  A.  When  A  =  0,  the  equations  reduce  to 


b  =  ^  (f){xi)R{xi,  Xi+i) 

(6.2) 

i=0 


the  same  as  those  derived  by  Bradtke  and  Barto  using  a  different  approach  based  on 
regression  with  instrumental  variables  [Bradtke  and  Barto’96]. 

At  the  other  extreme,  when  A  =  1,  LSTD(l)  produces  precisely  the  same  A 
aud  b  as  would  be  produced  by  supervised  linear  regression  on  training  pairs  of 
(state  features  (->•  observed  eventual  outcomes),  as  described  for  STAGE  in  Chapter  3 
(Equation  3.8,  page  67).  The  proof  of  this  equivalence  is  given  in  Appendix  A. 2. 
Thanks  to  the  algebraic  trick  of  the  eligibility  vectors,  LSTD(l)  builds  the  regression 
matrices  fully  incrementally — without  having  to  store  the  trajectory  while  waiting  to 
observe  the  eventual  outcome.  When  trajectories  through  the  chain  are  long,  this 
provides  significant  memory  savings  over  linear  regression. 

The  computation  per  timestep  required  to  update  A  and  b  is  the  same  as  least- 
squares  linear  regression:  0{K'^),  where  K  is  the  number  of  features.  LSTD(A)  must 
also  perform  a  matrix  inversion  at  a  cost  of  0{K^)  whenever  /3’s  coefficients  are 
needed — in  the  case  of  STAGE,  once  per  complete  trajectory.  (If  updated  coeffi¬ 
cients  are  required  more  frequently,  then  the  0{K^)  cost  can  be  avoided  by  recursive 
least-squares  [Bradtke  and  Barto  96]  or  Kalman-filtering  techniques  [Bertsekas  and 
Tsitsiklis  96,  §3.2.2],  which  update  /3  on  each  timestep  at  a  cost  of  only  0{K‘^).) 
LSTD(A)  is  more  computationally  expensive  than  incremental  TD(A),  which  updates 
the  coefficients  using  only  0{K)  computation  per  timestep.  However,  LSTD(A)  offers 
several  significant  advantages,  as  pointed  out  by  Bradtke  and  Barto  in  their  discussion 
of  LSTD(O)  [96]: 


•  Least-squares  algorithms  are  “more  efficient  estimators  in  the  statistical  sense” 
because  “they  extract  more  information  from  each  additional  observation.” 


•  TD(A)’s  convergence  can  be  slowed  dramatically  by  a  poor  choice  of  the  stepsize 
parameters  a„.  LSTD(A)  eliminates  these  parameters. 

•  TD(A)’s  performance  is  sensitive  to  ||/3a  —  /^initlj,  the  distance  between  ji\  and 
the  initial  estimate  for  LSTD(A)  requires  no  arbitrary  initial  estimate. 

•  TD(A)  is  also  sensitive  to  the  ranges  of  the  individual  features.  LSTD(A)  is  not. 
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Section  6.1.4  below  presents  experimental  results  comparing  TD(A)  with  LSTD(A)  in 
terms  of  data  efficiency  and  time  efficiency. 


6.1.3  LSTD(A)  as  Model-Based  TD(A) 

Before  giving  experimental  results  with  LSTD(A),  I  would  like  to  point  out  an  in¬ 
teresting  connection  between  LSTD(A)  and  model-based  reinforcement  learning.  To 
begin,  let  us  restrict  our  attention  to  the  case  of  a  small  discrete  state  space  X, 
over  which  can  be  represented  and  learned  exactly  by  a  lookup  table.  A  classical 
model-based  algorithm  for  learning  from  simulated  trajectory  data  would  proceed 
as  follows: 

f-  From  the  state  transitions  and  rewards  observed  so  far,  build  in  memory  an 
empirical  model  of  the  Markov  chain.  The  sufficient  statistics  of  this  model 
consist  of,  for  each  state  x  E  X: 


•  n{x),  the  number  of  times  state  x  was  visited; 

•  c{x'\x),  the  count  of  how  many  times  x'  followed  x  for  each  state  x'  6  X. 
We  do  not  need  to  track  the  absorption  frequency  c(END|a;)  separately, 
since  c(end|x)  =  n{x)  - 

•  s(a;),  the  sum  of  all  observed  one-step  rewards  from  state  x.  (The  expected 
reward  at  x  is  then  given  simply  by  s{x)ln{x).) 

2.  Whenever  a  new  estimate  of  the  value  function  is  desired,  solve  the  current 
empirical  model.  Solving  the  model  means  solving  the  linear  system  of  Bellman 
equations  (Eq.  2.3),  which  can  be  written  in  the  above  notation  as,  Var  6  X: 


V^{x) 


•^(^)  , 

nix)  ^ 
''  ’  x'ex 


n[x) 


(6.3) 


In  matrix  notation,  if  we  let  N  be  the  diagonal  matrix  of  visitation  frequencies 
‘n{x),  let  C  be  the  matrix  of  counts  where  Cjj  =  c(^Xj\xi),  let  s  be  the  vector 
of  summed-rewards  s(x),  and  let  v  be  the  vector  of  values,  then  the  above 
equations  become  simply 


whose  solution  is  given  by 


Nv  =  s  Cv, 


v  =  (N-C)-'s. 


(6.4) 
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This  model-based  technique  contrasts  with  TD(A),  a  model-free  approach  to  the  same 
problem.  TD(A)  does  not  maintain  any  statistics  on  observed  transitions  and  rewards; 
it  simply  updates  the  components  of  v  incrementally  after  each  observed  trajectory. 
In  the  limit,  assuming  a  lookup-table  representation,  both  converge  to  the  optimal 
V^.  The  advantage  of  TD(A)  is  its  low  computational  burden  per  step;  the  advantage 
of  the  classical  model-based  method  is  that  it  makes  the  most  of  the  available  training 
data.  The  empirical  advantages  of  model-based  and  model-free  reinforcement  learning 
methods  have  been  investigated  in,  e.g.,  [Sutton  90,  Moore  and  Atkeson  93,Atkeson 
and  Santamaria  97,Kuvayev  and  Sutton  97]. 

Where  does  LSTD(A)  fit  in?  The  answer  is  that,  when  A  =  0,  it  precisely  du¬ 
plicates  the  classical  model-based  method  sketched  above.  When  A  >  0,  it  does 
something  else  sensible,  which  I  describe  below;  but  let  us  first  consider  LSTD(O). 
The  assumed  lookup-table  representation  for  means  that  we  have  one  indepen¬ 
dent  feature  per  state:  the  feature  vector  cf>  corresponding  to  state  1  is  (1, 0, 0, . . .  ,  0); 
corresponding  to  state  2  is  (0,1,0, ...  ,0);  etc.  LSTD(O)  performs  the  following  op¬ 
erations  upon  each  observed  transition  (cf.  Equation  6.2): 

b  :=  b  -b  4>{xt)Rt 

A  :=  A  +  4>{xt){4){xt)  -  (j>{xt+i)y 

Clearly,  the  role  of  b  is  to  sum  all  the  rewards  observed  at  each  state,  exactly  as  the 
vector  s  does  in  the  classical  technique.  A,  meanwhile,  accumulates  the  statistics 
(N  —  C).  To  see  this,  note  that  the  outer  product  given  above  is  a  matrix  consisting 
of  an  entry  of  -|-1  on  the  single  diagonal  element  corresponding  to  state  xt',  an  entry 
of  —1  on  the  element  in  row  Xf,  column  Xt^\\  and  all  the  rest  zeroes.  Summing  one 
such  sparse  matrix  for  each  observed  transition  gives  A  =  N  —  C.  Finally,  LSTD(O) 
performs  the  inversion  (3  :=  A“^b  =  (N  —  C)“^s,  giving  the  same  solution  as  in 
Equation  6.4. 

Thus,  when  A  =  0,  the  A  and  b  matrices  built  by  LSTD(A)  effectively  record  a 
model  of  all  the  observed  transitions.  What  about  when  A  >  0?  Again,  A  and  b 
record  the  sufficient  statistics  of  an  empirical  Markov  model — but  in  this  case,  the 
model  being  captured  is  one  whose  single-step  transition  probabilities  directly  encode 
the  multi-step  TD(A)  backup  operations,  as  defined  by  Eq.  6.1.  That  is,  the  model 
links  each  state  x  to  all  the  downstream  states  that  follow  x  on  any  trajectory,  and 
records  how  much  influence  each  has  on  estimating  V'^{x)  according  to  TD(A).  In  the 
case  of  A  =  0,  the  TD(A)  backups  correspond  to  the  one-step  transitions,  resulting 
in  the  equivalence  described  above.  The  opposite  extreme,  the  case  of  A  =  1,  is 
also  interesting:  the  empirical  Markov  model  corresponding  to  TD(l)’s  backups  is 
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the  chain  where  each  state  x  leads  directly  to  absorption  (i.e.,  all  counts  are  zero: 
Cl  =0).  The  values  Si(a;)  equal  the  sum,  over  all  visits  to  x  on  all  trajectories,  of 
all  the  observed  rewards  between  x  and  termination.  LSTD(l)  then  solves  for  as 
follows: 

:=  A-^b  =  (N  -  Cij-^si  =  N-^Si, 

which  simply  computes  the  average  of  the  Monte-Carlo  returns  at  each  state  x.  In 
short,  if  we  assume  a  lookup-table  representation  for  the  function  V'^,  we  can  view 
the  LSTD(A)  algorithm  as  doing  these  two  steps: 

1.  It  implicitly  uses  the  observed  simulation  data  to  build  a  Markov  chain.  This 
chain  compactly  models  all  the  backups  that  TD(A)  would  perform  on  the  data. 

2.  It  solves  the  chain  by  performing  a  matrix  inversion. 

The  lookup-table  representation  for  is  intractable  in  practical  problems;  in 
practice,  LSTD(A)  operates  on  states  only  via  their  (linearly  dependent)  feature  rep¬ 
resentations  (f>{x).  In  this  case,  we  can  view  LSTD(A)  as  implicitly  building  a  com¬ 
pressed  version  of  the  empirical  model’s  transition  matrix  N  -  C  and  summed-reward 
vector  s: 

A  =  #'^(N-C)# 

h  =  ^^s, 

where  $  is  the  \X\x  K  matrix  representation  of  the  function  <p  :  X  3?-^.  From  the 
compressed  empirical  model,  LSTD(A)  computes  the  following  coefficients  for  V'^: 

/3x  =  A“^b 

=  (6-5) 

Ideally,  these  coefficients  jSx  would  be  equivalent  to  the  empirical  optimal  coefficients 
/3l.  The  empirical  optimal  coefficients  are  those  that  would  be  found  by  building  the 
full  uncompressed  empirical  model  (represented  by  N  -  C  and  s),  using  a  lookup 
table  to  solve  for  that  model’s  value  function  (v  =  (N  —  C)~^s),  and  then  performing 
a  least-squares  linear  fit  from  the  state  features  $  to  the  lookup-table  value  function: 

131  = 

=  ($t^)-i^T(n_c)-1s. 

It  can  be  seen  that  Equations  6.5  and  6.6  are  indeed  equivalent  for  the  case  of  A  =  1, 
because  that  setting  of  A  implies  that  C  =  0  so  (N  —  C)  is  diagonal.  However,  for  the 
case  of  A  <  1,  solving  the  compressed  empirical  model  does  not  in  general  produce 
the  optimal  least-squares  fit  to  the  solution  of  the  uncompressed  model. 
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6.1.4  Empirical  Comparison  of  TD(A)  and  LSTD(A) 

I  experimentally  compared  the  performance  of  TD(A)  and  LSTD(A)  on  the  simple 
“Hopworld”  Markov  chain  described  in  Section  2.3.1.  The  chain  consists  of  13  states, 
as  illustrated  in  Figure  2.2  (p.  31).  We  seek  to  represent  the  value  function  of  this 
chain  compactly — as  a  linear  function  of  four  state  features.  In  fact,  this  domain  has 
been  contrived  so  that  the  optimal  function  is  exactly  linear  in  these  features: 
the  optimal  coefficients  are  (— 24,  — 16,  — 8, 0).  This  condition  guarantees  that 
LSTD(A)  will  converge  with  probability  1  to  the  optimal  for  any  setting  of  A. 

TD(A)  is  also  guaranteed  convergence  to  the  optimal  under  the  additional 
condition  that  an  appropriate  schedule  of  stepsizes  is  chosen.  As  mentioned  in  Sec¬ 
tion  6.1.1  above,  the  sequence  of  stepsizes  (q:„)  must  satisfy  three  criteria:  On  >  0  Vn; 
^^1  0!„  =  oo;  and  <  oo.  For  example,  all  three  criteria  are  satisfied  by 

schedules  of  the  following  form: 


def  ^0  4“  1 

o:n  =  ao - ; — 

no  -I-  n 


n  =  1,2, . . . 


The  parameter  ao  determines  the  initial  stepsize,  and  no  determines  how  gradually 
the  stepsize  decreases  over  time.  I  ran  each  TD(A)  experiment  with  six  different 
stepsize  schedules,  corresponding  to  the  six  combinations  of  ao  G  {0.1,0.01}  and 
no  G  {10^,  10^,  10®}.  These  six  schedules  are  plotted  in  Figure  6.1. 


0  2000  4000  6000  8000  10000 

n:  trajectory  number 


Figure  6.1.  The  six  different  stepsize  schedules  used  in  the  experiments  with  TD(A). 
The  schedules  are  determined  by  Equation  6.7  with  various  settings  for  ao  and  no. 
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On  the  Hopworld  domain,  I  ran  both  TD(A)  and  LSTD(A)  for  a  variety  of  settings 
of  A.  Figure  6.2  plots  the  results  for  the  case  of  A  =  0.4.  The  plot  compares  the 
performance  of  six  variants  of  TD(0.4) — corresponding  to  the  six  different  stepsize 
schedules — and  LSTD(0.4).  The  a;-axis  counts  the  number  of  trajectories  sampled, 
up  to  a  limit  of  10,000;  and  the  y-axis  measures  the  RMS  error  of  the  approximated 
value  function  defined  by 

IIV'"  -  r”||  =  E  (^'(^)  -  '  ■  (6-8) 

Each  point  on  the  plot  represents  the  average  of  10  trials. 


Figure  6.2.  Performance  of  TD (0.4)  and  LSTD(0.4)  on  the  Hopworld  domain.  Note 
the  logarithmic  scale  of  the  ar-axis. 


The  plot  shows  clearly  that  for  A  =  0.4,  LSTD(A)  learns  a  good  approximation  to 
in  fewer  trials  than  any  of  the  TD(A)  experiments,  and  performs  better  asymp¬ 
totically  as  well.  These  results  held  uniformly  across  all  values  of  A.  Table  6.2  gives 
the  results  for  A  =  0,  A  =  0.4,  and  A  =  1.  The  results  may  be  summarized  as  follows; 

•  For  all  values  of  A,  the  convergence  rate  of  LSTD(A)  exceeded  that  of  TD(A). 
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Algorithm 

Fit  error 
after  100 
trajectories 

Fit  ettor 
after  10,000 
trajectories 

TD(0),  oo  =  0.1,  no  =  10® 

0.49±  0.15 

0.37±  0.05 

TD(0),  ao  =  0.1,  no  =  10® 

0.38±  0.10 

0.09±  0.02 

TD(0),  oo  =  0.1,  no  =  10^ 

0.32±  0.07 

0.04±  0.02 

TD(0),  ao  =  0.01,  no  =  10® 

9.08±  0.04 

0.10±  0.02 

TD(0),  otQ  =  0.01,  no  =  10® 

9.29±  0.04 

0.04±  0.02 

TD(0),  ao  =  0.01,  no  =  10^ 

10.66±  0.02 

0.70±  0.01 

LSTD(O) 

0.19±  0.09 

0.01±  0.01 

TD(0.4),  ao  =  0.1,no  =  10® 

0.42±  0.15 

0.40±  0.09 

TD(0.4),  tto  =  0.1,  no  =  10® 

0.39±  0.16 

0.12±  0.04 

TD(0.4),  ao  =  0.1,no=  10^ 

0.25±  0.08 

0.03±  0.01 

TD(0.4),  ao  =  0.01,  no  =  10® 

6.73±  0.07 

0.14±  0.04 

TD(0.4),  ao  =  0.01,  no  =  lO® 

6.97±  0.06 

0.04±  0.01 

TD(0.4),  ao  =  0.01,  no  =  10^ 

8.73±  0.04 

0.17±  0.01 

LSTD(0.4) 

0.15±  0.04 

0.01±  0.00 

TD(1),  ao  =  0.1,  no  =  10® 

0.73±  0.27 

0.54±  0.15 

TD(1),  ao  =  0.1,  no  =  10® 

0.48±  0.20 

0.17±  0.06 

TD(1),  ao  =  0.1,  no  =  10® 

0.30it  0.10 

0.06±  0.02 

TD(1),  ao  =  0.01,  no  =  10® 

1.86±  0.14 

0.13±  0.03 

TD(1),  ao  =  0.01,  no  =  10® 

2.05±  0.14 

0.05±  0.02 

TD(1),  ao  =  0.01,  no  =  10® 

3.31±  0.12 

0.03±  0.01 

LSTD(l) 

0.14±  0.04 

O.Olzh  0.00 

Table  6.2.  Summary  of  results  with  TD(A)  and  LSTD(A)  on  the  Hopworld  domain 
for  A  =  0,  0.4,  and  1.0.  Fit  errors  are  measured  according  to  Equation  6.8;  the  mean 
over  10  trials  and  95%  confidence  interval  of  the  mean  are  displayed.  Results  for  other 
values  of  A  were  similar. 
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•  The  performance  of  TD(A)  depends  critically  on  the  stepsize  schedule  chosen. 

LSTD(A)  has  no  tunable  parameters  other  than  A  itself. 

•  Varying  A  had  no  discernible  effect  on  LSTD(A)’s  performance. 

Because  the  Hopworld  domain  is  so  small  and  the  optimal  value  function  is  exactly 
linear  over  the  available  features,  these  results  are  not  necessarily  representative  of 
how  TD(A)  and  LSTD(A)  will  perform  on  practical  problems.  For  example,  if  a 
domain  has  many  features  and  simulation  data  is  available  cheaply,  then  incremental 
methods  will  have  better  real-time  performance  than  least-squares  methods  [Sutton 
92] .  On  the  other  hand,  some  reinforcement-learning  applications  have  been  successful 
with  very  small  numbers  of  features  (e.g.,  [Singh  and  Bertsekas  97]).  STAGE’S  results 
certainly  meet  this  description.  I  investigate  the  performance  of  LSTD(A)  in  the 
context  of  STAGE  in  the  following  section. 

One  exciting  possibility  for  future  work  is  to  apply  LSTD(A)  in  the  context  of 
Markov  decision  problems — that  is,  for  the  purpose  of  approximating  not  V"  but 
V*.  LSTD(A)  could  provide  an  efficient  alternative  to  TD(A)  in  the  inner  loop  of 
optimistic  policy  iteration  [Bertsekas  and  Tsitsiklis  96]. 

6.1.5  Applying  LSTD(A)  in  STAGE 

The  application  of  LSTD(A)  within  STAGE  is  straightforward.  We  are  given  a  local 
search  procedure  n  that  is  assumed  to  be  proper,  Markovian  and  monotonic,  which 
implies  that  tt  induces  a  Markov  chain  over  the  configuration  space  X.  (To  use 
LSTD(A)  with  a  nonmonotonic  local  search  procedure  such  as  WALKSAT,  the  best- 
so-far  abstraction  must  be  applied;  see  Appendix  A.l.)  Note  that  the  one-step  rewards 
in  the  induced  Markov  chain  are  all  zero  except  at  termination  (see  Eq.  3.4,  p.  59). 
This  means  that  the  update  step  b  :=  h  +  ZfRt  may  be  moved  outside  the  inner  loop 
of  LSTD(A). 

The  interesting  empirical  question  concerns  the  best  value  of  A.  The  STAGE 
experiments  of  Chapters  4  and  5  all  used  supervised  least-squares  regression,  which 
is  equivalent  to  A  =  1.  Is  that  value  of  A  best,  since  the  predictions  it  generates 
are  unbiased  estimates  of  the  true  function?  Or  will  a  lower  setting  of  A  enable 
learning  in  fewer  trials,  since  LSTD(A)’s  implicit  transition  model  enables  it  to  treat 
noisy  training  data  more  informedly? 

I  performed  experiments  with  LSTD(A)  on  three  optimization  problem  instances: 
cartogram  design  instance  US49  and  Bayes  net  structure-finding  instances  SYNTH125K 
and  ADULT2.  Other  than  A,  all  STAGE  parameters  were  fixed  at  the  same  settings 
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specified  earlier  in  Sections  4.4  and  4.6.  The  A  parameter  was  varied  over  six  evenly- 
spaced  settings  between  0  and  1. 

The  experimental  results,  shown  in  Table  6.3  and  the  accompanying  three  figures, 
were  inconclusive.  On  the  cartogram  design  problem,  LSTD(l)  decisively  outper¬ 
formed  all  the  smaller  values  of  A.  By  contrast,  on  SYNTH125K,  all  values  of  A  <  0.8 
produced  significantly  faster  learning  than  LSTD(l),  although  Figure  6.4  shows  that 
LSTD(l)  nearly  catches  up  by  the  time  100,000  moves  have  been  considered.  Fi¬ 
nally,  on  instance  ADULT2,  there  were  no  significant  differences  between  any  of  the 
different  settings  for  A.  From  this  set  of  results,  the  most  that  can  be  said  is  that 
values  of  A  <  1  sometimes  help,  sometimes  hurt,  and  sometimes  have  no  effect  on  the 
performance  of  STAGE  in  practice. 


Instance 

Algorithm 

Performance  {N  runs 

each) 

mean 

best 

worst 

Cartogram 

LSTD(O.O) 

0.083±0.008 

0.040 

0.178 

(US49) 

LSTD(0.2) 

0.078±0.009 

0.040 

0.183 

Ar  =  50,M  =  10® 

LSTD(0.4) 

0.074±0.007 

0.041 

0.170 

LSTD(0.6) 

0.079±0.010 

0.038 

0.204 

LSTD(0.8) 

0.076±0.007 

0.042 

0.175 

LSTD(l.O) 

0.057±0.004 

0.037 

0.105 

Bayes  net 

LSTD(O.O) 

736241±  1721 

720244 

784874 

(SYNTH125K) 

LSTD(0.2) 

734806±  1722 

719431 

777711 

N  =  200,  M  =  4  •  10^ 

LSTD(0.4) 

734548±  1674 

719187 

779267 

LSTD(0.6) 

736164F  1938 

719261 

796555 

LSTD(0.8) 

736094±  1796 

719068 

776308 

LSTD(l.O) 

741111±  1871 

719748 

790014 

Bayes  net 

LSTD(O.O) 

440511±  59 

439372 

441052 

(ADULT2) 

LSTD(0.2) 

440531±  62 

439460 

441247 

A^  =  100,M  =  105 

LSTD(0.4) 

440540±  60 

439761 

441168 

LSTD(0.6) 

440490±  52 

439767 

441208 

LSTD(0.8) 

440484±  67 

439267 

441152 

LSTD(l.O) 

44046 1±  61 

439715 

441005 

Table  6.3.  STAGE  performance  with  LSTD(A)  on  three  optimization  instances. 
Each  line  summarizes  the  performance  of  N  trials,  each  limited  to  considering  M 
total  search  moves. 
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Figure  6.3.  Cartogram  performance  with  LSTD(A) 


Performance  after  40,000  moves  considered 


Figure  6.4.  Performance  of  STAGE  with  LSTD(A)  on  Bayes  net  structure-finding 
instance  SYNTH125K.  After  40,000  moves,  LSTD(l.O)  significantly  lagged  all  other 
values  for  A. 
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Figure  6.5.  Performance  of  STAGE  with  LSTD(A)  on  Bayes  net  structure-finding 
instance  ADULT2 


6.2  Transfer 

This  section  concerns  a  second  significant  extension  to  STAGE:  enabling  transfer 
of  learned  knowledge  between  related  problem  instances.  I  motivate  and  describe 
an  algorithm  for  transfer,  X-STAGE,  and  present  empirical  results  on  two  problem 
families. 

6.2.1  Motivation 

There  is  a  computational  cost  to  training  a  function  approximator  on  V^.  Learn¬ 
ing  from  a  7r-trajectory  of  length  L,  with  either  linear  regression  or  LSTD(A)  over 
D  features,  costs  STAGE  0{D^L  D^)  per  iteration;  quadratic  regression  costs 

0{D^L  -|-  D®).  In  the  experiments  of  Chapter  4,  these  costs  were  minimal — typically, 
0-10%  of  total  execution  time.  However,  STAGE’s  extra  overhead  would  become 
significant  if  many  more  features  or  more  sophisticated  function  approximators  were 
used.  Furthermore,  even  if  the  function  approximation  is  inexpensive,  STAGE  may 
require  many  trajectories  to  be  sampled  in  order  to  obtain  sufficient  data  to  fit  V'^ 
effectively. 

For  some  problems  such  costs  are  worth  it  in  comparison  with  a  non-learning 
method,  because  a  better  or  equally  good  solution  is  obtained  with  overall  less  com¬ 
putation.  But  in  those  cases  where  we  use  more  computation,  the  STAGE  method 
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may  nevertheless  be  preferable  if  we  are  then  asked  to  solve  further  similar  problems 
(e.g.,  a  new  channel  routing  problem  with  different  pin  assignments).  Then  we  can 
hope  that  the  computation  we  invested  in  solving  the  first  problem  will  pay  off  in  the 
second,  and  future,  problems  because  we  will  already  have  a  V'^  estimate.  This  effect 
is  called  transfer-,  the  extent  to  which  it  occurs  is  largely  an  empirical  question. 

To  investigate  the  potential  for  transfer,  I  re-ran  STAGE  on  a  suite  of  eight  prob¬ 
lems  from  the  channel  routing  literature  [Chao  and  Harper  96].  Table  6.4  summarizes 
the  results  and  gives  the  coefficients  of  the  linear  evaluation  function,  learned  inde¬ 
pendently  for  each  problem.  To  make  the  similarities  easier  to  see  in  the  table,  I  have 
normalized  the  coefficients  so  that  their  squares  sum  to  one;  note  that  the  search 
behavior  of  an  evaluation  function  is  invariant  under  positive  linear  transformations. 


Problem 

instance 

lower 

bound 

best-of-3 

hillclimbing 

best-of-3 

STAGE 

learned  coefficients 
^  (^w>  I^U  ^ 

YK4 

10 

22 

12 

<  0.71,  0.05,  -0.70  > 

HYCl 

8 

8 

8 

<  0.52,  0.83,  -0.19  > 

HYC2 

9 

9 

9 

<0.71,  0.21,  -0.67  > 

HYC3 

11 

12 

12 

<  0.72,  0.30,  -0.62  > 

HYC4 

20 

27 

23 

<  0.71,  0.03,  -0.71  > 

HYC5 

35 

39 

38 

<  0.69,  0.14,  -0.71  > 

HYC6 

50 

56 

51 

<  0.70,  0.05,  -0.71  > 

HYC7 

39 

54 

42 

<  0.71,  0.13,  -0.69  > 

HYC8 

21 

29 

25 

<  0.71,  0.03,  -0.70  > 

Table  6.4.  STAGE  results  on  eight  problems  from  [Chao  and  Harper  96].  The 
coefficients  have  been  normalized  so  that  their  squares  sum  to  one. 


The  similarities  among  the  learned  evaluation  functions  are  striking.  Except  on 
the  trivially  small  instance  HYCl,  all  the  STAGE-learned  functions  assign  a  relatively 
large  positive  weight  to  feature  w{x),  a  similarly  large  negative  weight  to  feature  U (a;), 
and  a  small  positive  weight  to  feature  p{x).  In  Section  5.1.3  (see  Eq.  5.1,  page  116), 
I  explained  the  coefficients  found  on  instance  YK4  as  follows:  STAGE  has  learned 
that  good  hillclimbing  performance  is  predicted  by  an  uneven  distribution  of  track 
fullness  levels  (io(a;)  —  U (a;))  and  by  a  low  analytical  bound  on  the  effect  of  further 
merging  of  tracks  {p{x)).  From  Table  6.4,  we  can  conclude  that  this  explanation 
holds  generally  for  many  channel  routing  instances.  Thus,  transfer  between  instances 
should  be  fruitful. 
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6.2.2  X-STAGE:  A  Voting  Algorithm  for  Transfer 

Many  sensible  methods  for  transferring  the  knowledge  learned  by  STAGE  from  “train¬ 
ing”  problem  instances  to  new  instances  can  be  imagined.  This  section  presents  one 
such  method.  STAGE’S  learned  knowledge,  of  course,  is  represented  by  the  approxi¬ 
mated  value  function  V”.  We  would  like  to  take  the  V”  information  learned  on  a  set  of 
training  instances  {/i,  h,--  -  ,  In}  and  use  it  to  guide  search  on  a  given  new  instance 
I'.  But  how  can  we  ensure  that  V”  is  meaningful  across  multiple  problem  instances 
simultaneously,  when  the  various  instances  may  differ  markedly  in  size,  shape,  and 
attainable  objective-function  value? 

The  first  crucial  step  is  to  impose  an  instance-independent  representation  on  the 
features  F(a;),  which  comprise  the  input  to  V’’’(F(x)).  In  their  algorithm  for  transfer 
between  job-'Shop  scheduling  instances  (which  I  will  discuss  in  detail  in  Section  7.2), 
Zhang  and  Dietterich  recognize  this:  they  define  “a  fixed  set  of  summary  statistics 
describing  each  state,  and  use  these  statistics  as  inputs  to  the  function  approximator” 
[Zhang  and  Dietterich  98].  As  it  so  happens,  almost  all  the  feature  sets  used  with 
STAGE  in  this  thesis  are  naturally  instance-independent,  or  can  easily  be  made  so 
by  normalization.  For  example,  in  Bayes-net  structure-finding  problems  (§4.4),  the 
feature  that  counts  the  number  of  “orphan”  nodes  can  be  made  instance-independent 
simply  by  changing  it  to  the  percentage  of  total  nodes  that  are  orphans. 

The  second  question  concerns  normalization  of  the  outputs  of  V'^{F{x)),  which 
are  predictions  of  objective-function  values.  In  Table  6.4  above,  the  nine  channel 
routing  instances  all  have  quite  different  solution  qualities,  ranging  from  8  tracks 
in  the  case  of  instance  HYCl  to  more  than  50  tracks  in  the  case  of  instance  HYC6. 
If  we  wish  to  train  a  single  function  approximator  to  make  meaningful  predictions 
about  the  expected  solution  quality  on  both  instances  HYCl  and  HYC6,  then  we  must 
normalize  the  objective  function  itself.  For  example,  V”  could  be  trained  to  predict 
not  the  expected  reachable  Obj  value,  but  the  expected  reachable  percentage  above 
a  known  lower  bound  for  each  instance.  Zhang  and  Dietterich  adopt  this  approach: 
they  heuristically  normalize  each  instance’s  final  job-shop  schedule  length  by  dividing 
it  by  the  difficulty  level  of  the  starting  state  [Zhang  and  Dietterich  98].  This  enables 
them  to  train  a  single  neural  network  over  all  problem  instances. 

However,  if  tight  lower  bounds  are  not  available,  such  normalization  can  be  prob¬ 
lematic.  Consider  the  following  concrete  example: 

•  There  are  two  similar  instances,  A  and  I2,  which  both  have  the  same  true 
optimal  solution,  say,  Obj(a;*)  =  130. 

•  A  single  set  of  features  f  is  equally  good  in  both  instances,  promising  to  lead 
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search  to  a  solution  of  quality  132  on  either. 

•  The  only  available  lower  bounds  for  the  two  instances  are  bi  —  110  and  62  =  120, 
respectively. 

In  this  example,  normalizing  the  objective  functions  to  report  a  percentage  above  the 
available  lower  bound  would  result  in  a  target  value  of  20%  for  on  /i  and  a 

target  value  of  10%  for  y’^(f)  on  I2.  At  best,  these  disparate  training  values  add 
noise  to  the  training  set  for  V^.  At  worst,  they  could  interact  with  other  inaccurate 
training  set  values  and  make  the  non-instance-specific  V'^  function  useless  for  guiding 
search  on  new  instances. 

I  adopt  here  a  different  approach,  which  eliminates  the  need  to  normalize  the 
objective  function  across  instances.  The  essential  idea  is  to  recognize  that  each  in¬ 
dividually  learned  function,  unnormalized,  is  already  suitable  for  guiding  search 
on  the  new  problem  I':  the  search  behavior  of  an  evaluation  function  is  scale-  and 
translation-invariant.  The  X-STAGE  algorithm,  specified  in  Table  6.5,  combines  the 
knowledge  of  multiple  functions  not  by  merging  them  into  a  single  new  evaluation 
function,  but  by  having  them  vote  on  move  decisions  for  the  new  problem  I' .  Note 
that  after  the  initial  set  of  value  functions  has  been  trained,  X-STAGE  performs  no 
further  learning  when  given  a  new  optimization  problem  I'  to  solve. 

Combining  decisions  by  voting  rather  than,  say,  averaging,  ensures  that  each 
training  instance  carries  equal  weight  in  the  decision-making  process,  regardless  of 
the  range  of  that  instance’s  objective  function.  Voting  is  also  robust  to  .“outlier” 
functions,  such  as  the  one  learned  on  instance  HYCl  in  Table  6.4  above.  Such  a 
function’s  move  recommendations  will  simply  be  outvoted.  A  drawback  to  the  voting 
scheme  is  that,  in  theory,  loops  are  possible  in  which  a  majority  prefers  x  over  x' , 
x'  over  x",  and  x"  over  x.  However,  I  have  not  seen  such  a  loop  in  practice,  and  if  one 
did  occur,  the  patience  counter  Pat  would  at  least  prevent  X-STAGE  from  getting 
permanently  stuck. 

6.2.3  Experiments 

I  applied  the  X-STAGE  algorithm  to  the  domains  of  bin-packing  and  channel  rout¬ 
ing.  For  the  bin-packing  experiment,  I  gathered  a  set  of  20  instances  from  the  OR- 
Library — the  same  20  instances  studied  in  Section  4.2.  Using  the  same  STAGE 
parameters  given  in  that  section  (p.  75),  I  trained  functions  for  all  of  the  20 
except  u250_13,  and  then  applied  X-STAGE  to  test  performance  on  the  held-out  in¬ 
stance.  The  performance  curves  of  X-STAGE  and,  for  comparison,  ordinary  STAGE 
are  shown  in  Figure  6.6.  The  semilog  scale  of  the  plot  clearly  shows  that  X-STAGE 
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X-STAGE{/i,/2,...  ,/Ar,/0: 

Given: 

•  a  set  of  training  problem  instances  {Ii, . . .  ,  1^}  and  a  test  instance  I'. 

Each  instance  has  its  own  objective  function  and  all  other  STAGE  parameters 
(see  p.  52).  It  is  assumed  that  each  instance’s  featurizer  F  :  X  9?^  maps 
states  to  the  same  number  D  of  real-valued  features. 

1.  Run  STAGE  independently  on  each  of  the  N  training  instances. 

This  produces  a  set  of  learned  value  functions  •  •  •  > 

2.  Run  STAGE  on  the  new  instance  but  with  STAGE’S  Step  2c — the  step 
that  searches  for  a  promising  new  starting  state  for  tt  (see  p.  52) — modified 
as  follows:  instead  of  performing  hillclimbing  on  a  newly  learned  per¬ 
form  voting-hillclimbing  on  the  set  of  previously  learned  functions.  Voting¬ 
hillclimbing  means  simply: 

Accept  a  proposed  move  from  state  x  to  state  x'  if  and  only  if,  for  a 
majority  of  the  learned  value  functions,  t^^(E(a;'))  <  Yf^{F{x)). 

Return  the  best  state  found. 


Table  6.5.  The  X-STAGE  algorithm  for  transferring  learned  knowledge  to  a  new 
optimization  instance 
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reaches  good  performance  levels  more  quickly  than  STAGE.  However,  after  only  about 
10  learning  iterations  and  10,000  evaluations,  the  average  performance  of  STAGE 
exceeds  that  of  X-STAGE.  STAGE’S  V''^  function,  finely  tuned  for  the  particular  in¬ 
stance  under  consideration,  ultimately  outperforms  the  voting-based  restart  policy 
generated  from  19  related  instances. 


1000  10000  100000 

Number  of  moves  considered 


Figure  6.6.  Bin-packing  performance  on  instance  u250_13  with  transfer  (X-STAGE) 
and  without  transfer  (STAGE).  Note  the  logarithmic  scale  of  the  ®-axis. 


The  channel  routing  experiment  was  conducted  with  the  set  of  9  instances  shown 
in  Table  6.4  above.  Again,  all  STAGE  parameters  were  set  as  in  the  experiments  of 
Chapter  4  (see  p.  81).  I  trained  functions  for  the  instances  HYCl . . .  HYC8,  and 
applied  X-STAGE  to  test  performance  on  instance  YK4.  The  performance  curves  of 
X-STAGE  and  ordinary  STAGE  are  shown  in  Figure  6.7.  Again,  X-STAGE  reaches 
good  performance  levels  more  quickly  than  does  STAGE.  This  time,  the  voting-based 
restart  policy  maintains  its  superiority  over  the  instance-specific  learned  policy  for 
the  duration  of  the  run. 

These  preliminary  experiments  indicate  that  the  knowledge  STAGE  learns  dur¬ 
ing  problem-solving  can  indeed  be  profitably  transferred  to  novel  problem  instances. 
Future  work  will  consider  ways  of  combining  previously  learned  knowledge  with  new 
knowledge  learned  during  a  run,  so  as  to  have  the  best  of  both  worlds:  exploiting 
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Figure  6.7.  Channel  routing  performance  on  instance  YK4  with  transfer  (X-STAGE) 
and  without  transfer  (STAGE).  Note  the  logarithmic  scale  of  the  a:-axis. 


general  knowledge  about  a  family  of  instances  to  reach  good  solutions  quickly,  and 
exploiting  instance-specific  knowledge  to  reach  the  best  possible  solutions. 

6.3  Discussion 

This  chapter  has  presented  two  significant  extensions  to  STAGE:  the  Least-Squares 
TD(A)  algorithm  for  efficient  reinforcement  learning  of  V'^]  and  the  X-STAGE  al¬ 
gorithm  for  transferring  V'^  functions  learned  on  a  set  of  problem  instances  to  new, 
similar  instances.  Many  further  interesting  extensions  to  STAGE  are  possible.  Af¬ 
ter  giving  a  survey  of  related  work  in  Chapter  7,  I  will  present  a  number  of  as-yet 
unexplored  ideas  for  STAGE  in  the  concluding  chapter. 


161 


Chapter  7 

Related  Work 


Heuristic  methods  for  global  optimization  have  assuredly  not  been  overlooked  in 
the  literature.  Their  practical  applications  are  important  and  numerous,  and  the 
literature  is  correspondingly  immense.  There  is  also  a  significant,  if  not  quite  as 
overwhelming,  literature  on  learning  evaluation  functions  for  heuristic  search  in  game¬ 
playing  and  problem-solving.  In  this  chapter,  I  review  the  prior  work  most  relevant 
to  STAGE  from  both  these  literatures.  The  reviewed  topics  are  organized  as  follows: 

•  §7.1:  adaptive  multi-restart  techniques  for  local  search; 

•  §7.2:  reinforcement  learning  for  combinatorial  optimization,  especially  the  study 
of  Zhang  and  Dietterich  [95]; 

•  §7.3:  simulation-based  methods  for  improving  AI  search  in  game-playing  and 
problem-solving  domains,  including  techniques  based  on  “rollouts”  and  on  learn¬ 
ing  evaluation  functions;  and 

•  §7.4:  genetic  algorithms. 

I  defer  a  high-level  discussion  of  STAGE’S  novel  contributions  to  the  next,  concluding 
chapter. 

7.1  Adaptive  Multi-Restart  Techniques 

An  iteration  of  hillclimbing  typically  reaches  a  local  optimum  very  quickly.  Thus,  in 
the  time  required  to  perform  a  single  iteration  of  (say)  simulated  annealing,  one  can 
run  many  hillclimbing  iterations  from  different  random  starting  points  (or  even  from 
the  same  starting  point,  if  move  operators  are  sampled  stochastically)  and  report 
the  best  result.  Empirically,  random  multi-start  hillclimbing  has  produced  excellent 
solutions  on  practical  computer  vision  tasks  [Beveridge  et  al.  96] ,  outperformed  sim¬ 
ulated  annealing  on  the  Traveling  Salesman  Problem  (TSP)  [Johnson  and  McGeoch 
95],  and  outperformed  genetic  algorithms  and  genetic  programming  on  several  large- 
scale  testbeds  [duels  and  Wattenberg  96]. 
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Nevertheless,  the  effectiveness  of  random  multi-start  local  search  is  limited  in 
many  cases  by  a  “central  limit  catastrophe”  [Boese  et  al.  94]:  random  local  optima 
in  large  problems  tend  to  all  have  average  quality,  with  little  variance  [Martin  and 
Otto  94].  This  means  the  chance  of  finding  an  improved  solution  diminishes  quickly 
from  one  iteration  to  the  next.  To  improve  on  these  chances,  an  adaptive  multi-start 
approach — designed  to  select  restart  states  with  better-than-average  odds  of  finding 
an  improved  solution — seems  appropriate.  Indeed,  in  the  theoretical  model  of  local 
search  proposed  by  Aldous  and  Vazirani  [94],  a  given  performance  level  that  takes 
0(n)  restarts  to  reach  by  a  random  starting  policy  can  instead  be  reached  with  as 
few  as  O(log  n)  restarts  when  an  adaptive  policy,  which  uses  successful  early  runs  to 
seed  later  starting  states,  is  used. 

Many  adaptive  multi-start  techniques  have  been  proposed.  One  particularly  rel¬ 
evant  study  has  recently  been  conducted  by  Boese  [95,96].  On  a  fixed,  well-known 
instance  of  the  TSP,  he  ran  local  search  2500  times  to  produce  2500  locally  optimal 
solutions.  Then,  for  each  of  those  solutions,  he  computed  the  average  distance  to  the 
other  2499  solutions,  measured  by  a  natural  distance  metric  on  TSP  tours.  The  results 
showed  a  stunning  correlation  between  solution  quality  and  average  distance:  high- 
quality  local  optima  tended  to  have  small  average  distance  to  the  other  optima — they 
were  “centrally”  located — while  worse  local  optima  tended  to  have  greater  average 
distance  to  the  others;  they  were  at  the  “outskirts”  of  the  space.  Similar  correlations 
were  found  in  a  variety  of  other  optimization  domains,  including  circuit  /graph  parti¬ 
tioning,  satisfiability,  number  partitioning,  and  job-shop  scheduling.  Boese  concluded 
that  many  practical  optimization  problems  exhibit  a  “globally  convex”  or  so-called 
“big  valley”  structure,  in  which  the  set  of  local  optima  appears  convex  with  one  central 
global  optimum.  Boese’s  intuitive  diagram  of  the  big  valley  structure  is  reproduced 
in  Figure  7.1. 

The  big  valley  structure  is  auspicious  for  a  STAGE-like  approach.  Indeed,  Boese’s 
intuitive  diagram,  motivated  by  his  experiments  on  large-scale  complex  problems, 
bears  a  striking  resemblance  to  the  1-D  wave  function  of  Figure  3.4  (p.  48),  which 
I  contrived  as  an  example  of  the  kind  of  problem  at  which  STAGE  would  excel. 
Working  from  the  assumption  of  the  big  valley  structure,  Boese  recommended  the 
following  two-phase  adaptive  multi-start  methodology  for  optimization: 

Phase  One:  Generate  R  random  starting  solutions  and  run  Greedy JDescent 
from  each  to  determine  a  set  of  corresponding  random  local  minima. 

Phase  Two:  Based  on  the  local  minima  obtained  so  far,  construct  adap¬ 
tive  starting  solutions  and  run  Greedy -Descent  A  times  from  each  one 
to  yield  corresponding  adaptive  local  minima. 
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Figure  7.1.  Intuitive  picture  of  the  “big  valley”  solution  space  structure.  (Adapted 
from  [Boese  95].) 


Intuitively,  the  two  phases  respectively  develop,  then  exploit,  a  structural 
picture  of  the  cost  surface.  [Boese  et  al.  94] 

At  this  level  of  description,  the  two  phases  of  Boese’s  recommended  search  regime 
correspond  almost  exactly  with  the  two  alternating  phases  of  STAGE.  The  main 
difference  is  that  Boese  hand-builds  a  problem-specific  routine  for  adaptively  con¬ 
structing  new  starting  states,  whereas  STAGE  uses  machine  learning  to  do  the  same 
automatically.  Another  difference  is  that  STAGE’S  learned  heuristic  for  constructing 
starting  states  is  based  on  full  hillclimbing  trajectories,  not  just  the  local  minima 
obtained  so  far. 

A  similar  methodology  underlies  the  current  best  heuristic  for  solving  large  Travel¬ 
ing  Salesman  problems,  “Chained  Local  Optimization”  (CLO)  [Martin  and  Otto  94]. 
CLO  performs  ordinary  hillclimbing  to  reach  a  local  optimum  z,  and  then  applies 
a  special  large-step  stochastic  operator  designed  to  “kick”  the  search  from  2:  into  a 
nearby  but  different  attracting  basin.  Hillclimbing  from  this  new  starting  point  pro¬ 
duces  a  new  local  optimum  2';  if  this  turns  out  to  be  much  poorer  than  2,  then  CLO 
returns  to  2,  undoing  the  kick.  In  effect,  CLO  constructs  a  new  high-level  search  space: 
the  new  operators  consist  of  large-step  kick  moves,  and  the  new  objective  function 
is  calculated  by  first  applying  hillclimbing  in  the  low-level  space,  then  evaluating  the 
resulting  local  optimum.  (A  similar  trick  is  often  applied  with  genetic  algorithms,  as 
I  discuss  in  Section  7.4  below.)  In  the  TSP,  the  kick  designed  by  Martin  and  Otto  [94] 
is  a  so-called  “double-bridge”  operation,  chosen  because  such  moves  cannot  be  easily 
found  nor  easily  undone  by  Lin-Kernighan  local  search  moves  [Johnson  and  McGeoch 
95].  Like  Boese’s  adaptive  multi-start,  CLO  relies  on  manually  designed  kick  steps 
for  finding  a  good  new  starting  state,  as  opposed  to  STAGE’S  learned  restart  policy. 
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Furthermore,  STAGE’s  “kicks”  place  search  in  not  just  a  random  nearby  basin,  but 
one  specifically  predicted  to  produce  an  improved  local  optimum. 

The  big  valley  diagram,  like  my  1-D  wave  function,  conveys  the  notion  of  a  global 
structure  over  the  local  optima.  Unlike  the  wave  function,  it  also  conveys  one  poten¬ 
tially  misleading  intuition:  that  starting  from  low-cost  solutions  is  necessarily  better 
than  starting  from  high-cost  solutions.  In  his  survey  of  local  search  techniques  for  the 
TSP,  Johnson  [95]  considered  four  different  randomized  heuristics  for  constructing 
starting  tours  from  which  to  begin  local  search.  He  found  significant  differences  in 
the  quality  of  final  solutions.  Interestingly,  the  heuristic  that  constructed  the  best- 
quality  starting  tours  (namely,  the  “Clarke- Wright”  heuristic)  was  also  the  one  that 
led  search  to  the  worst-quality  final  solutions — even  worse  than  starting  from  a  very 
poor,  completely  random  tour.  Such  “deceptiveness”  can  cause  trouble  for  simulated 
annealing  and  genetic  algorithms.  Large-step  methods  such  as  CLO  may  evade  some 
such  deceits  by  “stepping  over”  high-cost  regions.  STAGE  confronts  the  deceit  head- 
on:  it  explicitly  detects  when  features  other  than  the  objective  function  are  better 
predictors  of  final  solution  quality,  and  can  learn  to  ignore  the  objective  function 
altogether  when  searching  for  a  good  start  state. 

Many  other  sensible  heuristics  for  adaptive  restarting  have  been  shown  effective 
in  the  literature.  The  widely  applied  methodology  of  “tabu  search”  [Glover  and 
Laguna  93]  is  fundamentally  a  set  of  adaptive  heuristics  for  escaping  local  optima, 
like  CLO’s  kick  steps.  Hagen  and  Kahng’s  “Clustered  Adaptive  Multi-Start”  achieves 
excellent  results  on  the  VLSI  netlist  partitioning  task  [Hagen  and  Kahng  97];  like 
CLO,  it  alternates  between  search  with  high-level  operators  (constructed  adaptively 
by  clustering  elements  of  previous  good  solutions)  and  ordinary  local  search.  Jagota’s 
“Stochastic  Steep  Descent  with  Reinforcement  Learning”  heuristically  rewards  good 
starting  states  and  punishes  poor  starting  states  in  a  multi-start  hillclimbing  context 
[Jagota  et  al.  96,  Jagota  96].  The  precise  reward  mechanism  is  heuristically  determined 
and  appears  to  be  quite  problem-specific,  as  opposed  to  STAGE’s  uniform  mechanism 
of  predicting  search  outcomes  by  value  function  approximation.  As  such,  a  direct 
empirical  comparison  would  be  difficult. 

7.2  Reinforcement  Learning  for  Optimization 

Value  function  approximation  has  previously  been  applied  to  a  large-scale  combina¬ 
torial  optimization  task:  the  Space  Shuttle  Payload  Processing  domain  [Zhang  and 
Dietterich  95,  Zhang  96].  I  will  discuss  this  application  in  some  detail,  because  both 
the  similarities  and  differences  to  STAGE  are  instructive.  Please  refer  back  to  Sec¬ 
tion  2.1  for  details  on  the  algorithms  and  notation  of  value  function  approximation. 


^7.2  REINFORCEMENT  LEARNING  FOR  OPTIMIZATION  165 

The  Space  Shuttle  Payload  Processing  (SSPP)  domain  is  a  form  ,  of  job-shop 
scheduling',  given  a  partially  ordered  set  of  jobs  and  the  resources  required  by  each, 
assign  them  start  times  so  as  to  respect  the  partial  order,  meet  constraints  on  si¬ 
multaneous  resource  usage,  and  minimize  the  total  execution  time.  Following  the 
“repair-based  scheduling”  paradigm  of  Zweben  [94],  Zhang  and  Dietterich  defined  a 
search  over  the  space  of  fully  specified  schedules  that  meet  the  ordering  constraints 
but  not  necessarily  the  resource  constraints.  Search  begins  at  a  fixed  start  state,  the 
“critical  path  schedule,”  and  proceeds  by  the  application  of  two  types  of  determin¬ 
istic  operators:  Reassign-Pool  and  MOVE.  To  keep  the  number  of  instantiated 
operators  small,  they  considered  only  operations  which  would  repair  the  schedule’s 
earliest  constraint  violation.  Search  terminates  as  soon  as  a  violation-free  schedule  is 
reached. 

Zhang  and  Dietterich  applied  reinforcement  learning  to  this  domain  in  order  to 
obtain  transfer:  their  goal  was  to  learn  a  value  function  which  captured  knowledge 
about  not  a  single  instance  of  a  scheduling  problem,  but  rather  a  large  family  of 
related  scheduling  instances.  Thus,  the  input  to  their  value  function  approximator 
(a  neural  network)  consisted  of  abstract  instance-independent  features,  such  as  the 
percentage  of  the  schedule  containing  a  violation  and  the  mean  and  standard  deviation 
of  certain  slack  times.  Likewise,  the  ultimate  measure  of  schedule  quality  which  the 
neural  network  was  learning  to  predict  had  to  be  normalized  so  that  it  spanned  the 
same  range  regardless  of  problem  difficulty.  (The  voting-based  approach  to  transfer 
introduced  in  Section  6.2.2  of  this  thesis  allows  such  normalization  to  be  avoided.) 
Trained  on  many  scheduling  instances,  the  neural  network  could  then  be  applied  to 
guide  search  on  a  novel  scheduling  instance,  hopefully  producing  a  good  solution 
quickly. 

Unlike  STAGE,  which  seeks  to  learn  V”,  the  predicted  outcome  of  a  prespecified 
optimization  policy  tt — the  Zhang  and  Dietterich  approach  seeks  to  learn  V*,  the 
predicted  outcome  of  the  best  possible  optimization  policy.  Before  discussing  how  they 
approached  this  ambitious  goal,  we  must  address  how  the  “best  possible  optimization 
policy”  is  even  defined,  since  optimization  policies  face  two  conflicting  objectives: 
“produce  good  solutions”  and  “finish  quickly.”  In  the  SSPP  domain,  Zhang  and 
Dietterich  measured  the  total  cost  of  a  search  trajectory  (xq,  xi, . . .  ,  xjv,  END)  by 

Obj(xjv)  +  O.OOIA’ 

Effectively,  since  Obj(x)  is  near  1.0  in  this  problem,  this  cost  function  means  that  a 
1%  improvement  in  final  solution  quality  is  worth  about  10  extra  search  steps  [Zhang 
and  Dietterich  98].  The  goal  of  learning,  then,  was  to  produce  a  policy  tt*  to  optimize 
this  balance  between  trajectory  length  and  final  solution  quality. 
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Following  Tesauro’s  methodology  for  learning  V*  on  backgammon,  Zhang  and 
Dietterich  applied  optimistic  TD(A)  to  the  SSPP  domain.  Rather  than  training  only  a 
single  neural  network,  they  trained  a  pool  of  8-12  networks,  using  different  parameter 
settings  for  each.  They  trained  the  networks  on  a  set  of  small  problem  instances. 
Training  continued  until  performance  stopped  improving  on  a  validation  set  of  other 
problem  instances.  Then,  the  N  best-performing  networks  were  saved  and  used  in  a 
round-robin  fashion  for  comparisons  against  Zweben’s  iterative-repair  system  [Zweben 
et  al.  94],  the  previously  best  scheduler.  The  results  showed  that  searches  with  the 
learned  evaluation  functions  produced  schedules  as  good  as  Zweben’s  in  less  than  half 
the  CPU  time. 

Getting  these  good  results  required  substantial  tuning.  One  complication  involves 
state-space  cycles.  Since  the  move  operators  are  deterministic,  a  learned  policy  may 
easily  enter  an  infinite  loop,  which  makes  its  value  function  undefined.  Loops  are  fairly 
infrequent  in  the  SSPP  domain  because  most  operators  repair  constraint  violations, 
lengthening  the  schedule;  still,  Zhang  and  Dietterich  had  to  include  a  loop-detection 
and  escape  mechanism,  clouding  the  interpretation  of  V*.  To  attack  other  combina¬ 
torial  optimization  domains  with  their  method,  they  suggest  that  “it  is  important  to 
formulate  problem  spaces  so  that  they  are  acyclic”  [Zhang  and  Dietterich  98] — but 
such  formulations  are  unnatural  for  most  local  search  applications,  in  which  the  op¬ 
erators  typically  allow  any  solution  to  be  reached  from  any  other  solution.  STAGE 
finesses  this  issue  by  fixing  tt  to  be  a  proper  policy  such  as  hillclimbing. 

STAGE  also  manages  to  finesse  three  other  algorithmic  complications  which  Zhang 
and  Dietterich  found  it  necessary  to  introduce:  experience  replay  [Lin  93],  random 
exploration  (slowly  decreasing  over  time),  and  random-sample  greedy  search.  Expe¬ 
rience  replay,  i.e.,  saving  the  best  trajectories  in  memory  and  occasionally  retraining 
on  them,  is  unnecessary  in  STAGE  because  the  regression  matrices  always  maintain 
the  sufficient  statistics  of  all  historical  training  data.  Adding  random  exploration 
is  unnecessary  because  empirically,  STAGE’s  baseline  policy  tt  (e.g.,  stochastic  hill¬ 
climbing  or  WALKSAT)  provides  enough  exploration  inherently.  This  is  in  contrast 
to  the  SSPP  formulation,  where  actions  are  deterministic.  Finally,  STAGE  does 
not  face  the  branching  factor  problem  which  led  Zhang  and  Dietterich  to  introduce 
random-sample  greedy  search  (RSGS).  Briefly,  the  problem  is  that  when  hundreds  or 
thousands  of  legal  operators  are  available,  selecting  the  greedy  action,  as  optimistic 
TD(A)  requires,  is  too  costly.  RSGS  uses  a  heuristic  to  select  an  approximately  greedy 
move  from  a  subset  of  the  available  moves.  Again,  this  clouds  the  interpretation  of 
V*.  In  STAGE,  each  decision  is  simply  whether  to  accept  or  reject  a  single  available 
move,  and  the  interpretation  of  is  clear. 
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To  summarize,  STAGE  avoids  most  of  the  algorithmic  complexities  of  Zhang  and 
Dietterich’s  method  because  it  is  solving  a  fundamentally  simpler  problem:  estimat¬ 
ing  from  a  fixed  stochastic  tt,  rather  than  discovering  an  optimal  deterministic 
policy  TT*  and  value  function  V*.  It  also  avoids  many  issues  of  normalizing  problem 
instances  and  designing  training  architectures  by  virtue  of  the  fact  that  it  applies  in 
the  context  of  a  single  problem  instance.  However,  an  advantage  of  the  Zhang  and 
Dietterich  approach  is  that  it  holds  out  the  potential  of  identifying  a  truly  optimal 
or  near-optimal  policy  tt*.  STAGE  can  only  claim  to  learn  an  improvement  over  the 
prespecified  policy  tt. 

Another  published  work,  the  “Ant-Q”  system  [Dorigo  and  Gambaxdella  95],  is  also 
billed  as  a  reinforcement  learning  approach  to  combinatorial  optimization.  Based  on 
an  extensive  metaphor  with  the  behavior  of  ant  colonies,  it  has  been  applied  only  to 
the  TSP;  it  is  unclear  how  it  would  be  applied  to  other  optimization  domains. 

7.3  Rollouts  and  Learning  for  AI  Search 

Ordinary  hillclimbing,  simulated  annealing  and  genetic  algorithms  evaluate  the  qual¬ 
ity  of  each  neighboring  state  x'  by  its  objective  function  value  Obj(x').  STAGE 
evaluates  x'  by  how  promising  its  features  make  it  appear,  V^{F{x')).  A  third  possi¬ 
bility  is  to  evaluate  x'  by  performing  one  or  more  actual  sample  runs  of  hillclimbing 
starting  from  x'.  This  is  the  principle  of  so-called  “rollout  algorithms”  [Bertsekas  et 
al.  97,  Tesauro  and  Galperin  97],  a  term  borrowed  from  the  backgammon-strategy 
literature  [Woolsey  91].  Like  STAGE,  rollout  methods  evaluate  each  neighbor  based 
upon  its  long-term  promise  as  a  starting  state  for  a  fixed  policy  it.  Selecting  ac¬ 
tions  in  this  way  is  tantamount  to  performing  a  single  round  of  the  policy  iteration 
algorithm  [Howard  60]  and  can  be  guaranteed  to  improve  upon  tt  [Bertsekas  et  al.  97]. 

Unlike  STAGE,  rollout  methods  do  not  cache  the  results  of  their  lookahead  searches 
in  a  function  approximator  In  this  way  they  avoid  the  biases  that  feature  represen¬ 

tations  and  function  approximation  introduce,  perhaps  allowing  more  accurate  moves. 
However,  the  computational  cost  of  replacing  function  approximator  evaluations  by 
full  sample  runs  is  extremely  high.  This  cost  may  be  tolerable  in  a  game-playing  sce¬ 
nario  where  several  seconds  are  available  for  each  move  selection;  but  would  probably 
be  deadly  for  practical  optimization  algorithms,  where  (in  the  words  of  Buntine  [97]) 
a  general  maxim  applies:  “speed  over  smarts.” 

In  fact,  an  earlier  study  of  Abramson  [90]  demonstrated  the  power  of  rollout 
methods  for  game-playing,  although  he  was  apparently  unaware  of  the  connection  to 
Markov  decision  processes  and  policy  iteration.  In  his  “expected-outcome”  model, 
moves  in  the  game  of  Othello  were  made  by  computing,  via  multiple  rollouts  for  each 
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legal  move,  the  expected  probability  of  winning  assuming  both  sides  play  randomly 
from  that  point  on.  Though  this  assumption  is  clearly  inaccurate,  Abramson  reported 
empirical  good  play  at  Othello: 

Given  no  expert  information,  the  ability  to  evaluate  only  leaves,  and  a 
good  deal  of  computation  time,  [expected-outcome  functions]  were  able 
to  play  better  than  a  competent  (albeit  non-masterful)  function  that  had 
been  handcrafted  by  an  expert.  [Abramson  90] 

This  improvement  is  consistent  with  the  experience  of  Tesauro  and  Galperin  [97],  who 
found  that  a  rollout-based  player  (implemented  on  a  parallel  IBM  supercomputer) 
dramatically  improved  the  performance  of  both  weak  and  strong  evaluation  functions. 

Abramson  [90]  went  on  to  propose  and  test  learning  the  expected-outcome  function 
with  linear  regression  over  state  features,  just  as  STAGE  does.  Reinterpreted  in  the 
language  of  reinforcement  learning,  Abramson’s  method  fixed  a  policy  tt  (both  players 
play  randomly),  inducing  a  Markov  chain  on  the  game  space.  He  collected  samples 
of  the  policy  value  function  V'^{x)  by  running  tt  multiple  times  from  each  of  5000 
random  starting  states;  trained  a  linear  predictor  from  the  samples;  and  finally,  used 
the  predictor  as  a  static  evaluation  function  V'"{F{x))  to  select  moves.  STAGE  can  be 
seen  as  extending  Abramson’s  work  from  game-playing  into  the  realm  of  combinatorial 
optimization.  A  key  difference  is  that  STAGE  uses  a  high-quality  baseline  policy  tt, 
such  as  hillclimbing  or  WALKSAT,  in  place  of  Abramson’s  random  policy.  STAGE 
also  interleaves  its  training  and  decision-making  phases,  which  enables  it  to  adapt  its 
training  distribution  over  time  to  focus  on  high-quality  states. 

Abramson’s  work  is  one  of  many  examples  in  the  Artificial  Intelligence  literature  of 
learning  evaluation  functions  for  game-playing.  Samuel’s  pioneering  work  on  checkers 
[Samuel  59,  Samuel  67],  Christensen’s  work  on  chess  [Christensen  86],  and  Lee’s  work 
on  Othello  [Lee  and  Mahajan  88]  fall  into  this  category;  I  have  already  discussed 
these  in  Section  2.1.2  of  this  thesis.  More  recently,  the  successful  application  of 
reinforcement  learning  to  backgammon  [Tesauro  92,  Boyan  92]  has  inspired  similar 
investigations  in  chess  [Thrun  95,  Baxter  et  al.  97],  Go  [Schraudolph  et  al.  94],  and 
other  games. 

Moving  from  game-playing  to  problem-solving  domains,  a  study  by  Rendell  [83] 
addressed  evaluation  function  learning  in  the  sliding-tiles  puzzle  (15-puzzle).  His 
method,  when  abstracted  of  many  details,  bears  significant  similarities  to  STAGE: 
it  learns  to  approximate  a  function  which  measures  the  promise  of  each  state  as  a 
starting  state  for  a  given  policy.  In  particular,  the  policy  tt  is  a  best-first  search  (with 
backtracking)  guided  by  a  given  evaluation  function  f{x);  and  the  measurement  being 
approximated  is  not  a  value  function  but  a  so-called  “penetrance”  measure  at  each 
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state,  as  defined  by  Doran  and  Michie  [66].  In  STAGE-like  notation,  the  penetrance 
is  defined  as 

length  of  discovered  path  from  x  to  solution 
total  nodes  expanded  by  it  during  search  from  x 

For  example,  in  the  15-puzzle,  given  a  constant  evaluation  function  f{x)  =  1,  the 
policy  TT  reduces  to  breadth-first  search,  and  P^{x)  is  on  the  order  of  10“^  for  random 
starting  states  x]  whereas  a  perfect  evaluation  function  f{x)  =  V*{x)  would  give  rise 
to  a  backtracking-free  policy  tt  with  penetrance  P'^{x)  =  1  everywhere.  Rendell  fits  a 
linear  approximation  to  the  penetrance,  then  applies  the  fit  to  improve  search  control 
using  a  complex  bootstrapping  and  normalization  procedure. 

7.4  Genetic  Algorithms 

Genetic  algorithms  (GAs) — algorithms  based  on  metaphors  of  biological  evolution 
such  as  natural  selection,  mutation,  and  recombination— represent  another  heuristic 
approach  to  combinatorial  optimization  [Goldberg  89].  Translated  into  the  terminol¬ 
ogy  of  local  search,  “natural  selection”  means  rejecting  high-cost  states  in  favor  of 
low-cost  states,  like  hillclimbing;  “mutation”  means  a  small-step  local  search  oper¬ 
ation;  and  “recombination”  means  adaptively  creating  a  new  state  from  previously 
good  solutions.  GAs  have  much  in  common  with  the  adaptive  multi-start  hillclimb¬ 
ing  approaches  discussed  above  in  Section  7.1.  In  broad  terms,  the  GA  population 
carries  out  multiple  restarts  of  hillclimbing  in  parallel,  culling  poor-performing  runs 
and  replacing  them  with  new  adaptively  constructed  starting  states. 

To  apply  GAs  to  an  optimization  problem,  the  configuration  space  X  must  be 
represented  as  a  space  of  discrete  feature  vectors — typically  fixed-length  bitstrings 
{0, 1}^ — and  the  mapping  must  be  a  bijection,  so  that  a  solution  bitstring  in  the 
feature  space  can  be  converted  back  to  a  configuration  in  X.  (This  contrasts  to 
STAGE,  where  features  can  be  any  real-valued  vector  function  of  the  state,  and  the 
mapping  need  not  be  invertible.)  Typically,  a  GA  mutation  operator  consists  of 
flipping  a  single  bit,  and  a  recombination  operator  consists  of  merging  the  bits  of  two 
“parent”  bitstrings  into  the  new  “child”  bitstring.  The  effectiveness  of  GA  search 
depends  critically  on  the  suitability  of  these  operators  to  the  particular  bitstring 
representation  chosen  for  the  problem. 

Of  course,  the  effectiveness  of  any  local  search  algorithm  depends  on  the  neigh¬ 
borhood  operators  available;  but  genetic  algorithms  generally  allow  less  flexibility  in 
designing  the  neighborhood,  since  mutations  are  represented  as  bit-flips.  Hillclimb¬ 
ing  and  simulated  annealing,  by  contrast,  allow  sophisticated,  domain-specific  search 
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operators,  such  as  the  partition-graph  manipulations  used  in  simulated  annealing 
applications  of  VLSI  channel  routing  [Wong  et  al.  88] .  On  the  other  hand,  genetic 
algorithms  have  a  built-in  mechanism  for  combining  features  of  previously  discovered 
good  solutions  into  new  starting  states.  STAGE  can  be  seen  as  providing  the  best  of 
both  worlds:  sophisticated  search  operators  and  adaptive  restarts  based  on  arbitrary 
domain  features. 

Some  GA  implementations  do  manage  to  take  advantage  of  local  search  operators 
more  sophisticated  than  bit-flips,  using  the  trick  of  embedding  a  hillclimbing  search 
into  each  objective  function  evaluation  [Hinton  and  Nowlan  87].  That  is,  the  GA’s 
population  of  bitstrings  actually  serves  as  a  population  not  of  final  solutions  but  of 
starting  states  for  hillclimbing.  The  most  successful  GA  approaches  to  the  Traveling 
Salesman  Problem  all  work  this  way  so  that  they  can  exploit  the  sophisticated  Lin- 
Kernighan  local  search  moves  [Johnson  and  McGeoch  95].  Here,  the  GA  operators 
play  a  role  analogous  to  the  large-step  ^^kick  moves”  of  Chained  Local  Optimization 
[Martin  and  Otto  94],  as  described  in  Section  7.1  above.  Depending  on  the  particular 
implementation,  the  next  generation’s  population  may  consist  of  not  only  the  best 
starting  states  from  the  previous  generation,  but  also  the  best  final  states  found 
by  hillclimbing  runs  a  kind  of  Lamarckian  evolution  in  which  learned  traits  are 
inheritable  [Ackley  and  Littman  93,  Johnson  and  McGeoch  95]. 

In  such  a  GA,  the  population  may  be  seen  as  implicitly  maintaining  a  global 
predictive  model  of  where,  in  bitstring-space,  the  best  starting  points  are  to  be  found. 
The  COMIT  algorithm  of  Baluja  and  Davies  [97],  a  descendant  of  PHIL  [Baluja 
and  Caruana  95]  and  MIMIC  [de  Bonet  et  al.  97],  makes  this  viewpoint  explicit:  it 
generates  adaptive  starting  points  not  by  random  genetic  recombination,  but  rather 
by  first  building  an  explicit  probabilistic  model  of  the  population  and  then  sampling 
that  model.  COMIT’s  learned  probability  model  is  similar  in  spirit  to  STAGE’S 
function.  Differences  include  the  following: 

•  COMIT  is  restricted  to  bijective  bit  string-like  representations,  whereas  STAGE 
can  use  any  feature  mapping;  and 

•  COMIT’s  model  is  trained  from  only  the  set  of  best-quality  states  found  so 
far,  ignoring  the  differences  between  their  outcomes;  whereas  STAGE’S  value 
function  is  trained  from  all  states  seen  on  all  trajectories,  good  and  bad,  paying 
attention  to  the  outcome  values.  Boese’s  experimental  data  and  “big  valley 
structure”  hypothesis  (see  page  163)  indicate  that  there  is  often  useful  informa¬ 
tion  to  be  gained  by  modelling  the  weaker  areas  of  the  solution  space,  too  [Boese 
et  al.  94].  In  particular,  this  gives  STAGE  the  power  for  directed  extrapolation 
beyond  the  support  of  its  training  set. 
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In  preliminary  experiments  in  the  Boolean  satisfiability  domain,  on  the  same  32-bit 
parity  instances  described  in  Section  4.7,  COMIT  (using  WALKSAT  as  a  subroutine) 
did  not  perform  as  well  as  STAGE  [Davies  and  Baluja  98]. 

7.5  Discussion 

The  studies  described  in  the  last  four  sections  make  it  clear  that  STAGE  bears  close 
relationships  to  previous  work  done  in  the  optimization,  reinforcement  learning,  and 
AI  problem-solving  communities.  A  concise  summary  of  this  chapter  might  read  as 
follows: 

STAGE  takes  its  main  idea — learn  an  evaluation  function  to  improve 
search  performance — from  the  AI  literature  on  problem-solving.  It  grounds 
that  idea  in  the  theory  of  value  function  approximation.  And  it  applies 
that  idea  in  the  successful  framework  of  adaptive  multi-start  approaches 
to  global  optimization. 

STAGE  unifies  these  lines  of  research  in  an  algorithm  which  is  nonetheless  quite  simple 
to  explain  and  to  implement.  In  the  next,  concluding  chapter,  I  will  summarize  the 
novel  contributions  made  by  STAGE  and  suggest  a  number  of  future  directions  for 
integrating  reinforcement  learning  into  eflfective  optimization  algorithms. 
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•  (§3.1.3)  In  the  context  of  global  optimization,  I  have  recognized  that  addi¬ 
tional  features  of  each  state,  other  than  the  state’s  objective-function  value, 
can  provide  useful  information  for  decision  making  in  search.  Traditional  al¬ 
gorithms  either  ignore  this  information  or  incorporate  it  in  an  ad  hoc  manner. 
STAGE  provides  a  principled,  automatic  mechanism  for  exploiting  additional 
state  features. 

•  (§3.2)  I  have  defined  the  predictive  value  function  of  a  local  search  pro¬ 
cedure,  V'^{x)\  described  the  conditions  under  which  it  corresponds  to  the 
value  function  of  a  Markov  chain;  and  described  how  it  may  be  learned  from 
simulation  data  by  a  function  approximator. 

•  (§3.2)  I  have  introduced  STAGE,  a  straightforward  algorithm  for  exploiting 
the  learned  approximation  of  to  guide  future  search.  STAGE  is  general:  it 
can  be  applied  to  any  optimization  problem  to  which  hillclimbing  applies.  It 
may  be  viewed  as  an  adaptive  multi-restart  approach  to  optimization;  the  adap¬ 
tive  component  is  automatically  learned  from  simulation  data  on  each  problem 
instance. 

•  (§3.3)  I  have  presented  two  illustrative  domains — the  1-D  wave  minimization 
example  and  the  bin-packing  example— and  demonstrated  how  STAGE  succeeds 
on  them.  These  examples  provide  clear  intuitions  of  how  learning  evaluation 
functions  can  improve  search  performance. 

•  (§3.4)  I  have  analyzed  the  theoretical  conditions  under  which  STAGE  is 
well-defined  and  efficient.  Specifically,  I  have  shown  that  is  well-defined 
for  any  local  search  procedure  ;r,  as  long  as  the  objective  function  is  bounded 
below;  however,  STAGE  learns  to  approximate  most  efficiently  if  tt  is  proper, 
Markovian,  and  monotonic.  I  have  provided  several  methods  for  converting 
improper  and  nonmonotonic  procedures  into  a  form  suitable  for  STAGE. 

•  (Chapters  4  and  5)  I  have  contributed  empirical  evidence  that  STAGE  is 
applicable,  practical  and  effective  on  a  wide  variety  of  large-scale  optimization 
tasks.  On  most  tested  instances,  STAGE  outperforms  both  multi-start  hill¬ 
climbing  and  a  good  implementation  of  simulated  annealing.  On  challenging 
instances  of  the  Boolean  satisfiability  task,  STAGE  successfully  learned  from 
WALKS  AT ,  a  non-greedy  local  search  procedure — and  produced  the  best  pub¬ 
lished  solutions  to  date.  The  empirical  analyses  of  Chapter  5  demonstrate 
stage’s  overall  robustness  with  various  parameter  settings,  and  also  explain 


§S.2  FUTURE  DIRECTIONS 


175 


under  what  circumstances  STAGE  will  fail  to  improve  optimization  perfor¬ 
mance. 

•  (§6.1)  I  have  introduced  a  least-squares  formulation  of  the  TD(A)  algo¬ 
rithm,  extending  the  work  of  Bradtke  and  Barto  [96].  I  empirically  demonstrate 
the  improved  data  elEciency  of  this  formulation,  and  give  a  new  intuitive  expla¬ 
nation  for  the  source  of  this  efficiency:  the  statistics  kept  by  LSTD(A)  amount 
to  a  compressed  model  of  the  underlying  Markov  process. 

•  (§6.2)  I  have  motivated,  described  and  tested  a  new  approach  to  transferring 
learned  information  from  previously  solved  optimization  problem  instances 
to  new  ones.  By  using  a  voting  mechanism,  the  X-STAGE  algorithm  avoids 
having  to  normalize  the  objective  function  across  disparate  instances. 

•  (Chapter  7)  I  have  surveyed  the  literatures  of  several  related  areas:  value 
function  approximation  (§2.1.2,  §7.2);  adaptive  multi-restart  techniques  for  local 
search  (§7.1),  including  genetic  algorithms  (§7.4);  and  simulation-based  learning 
methods  for  improving  AI  search  (§7.3).  Taken  together,  these  surveys  provide 
a  useful  collection  of  references  for  researchers  interested  in  automatic  learning 
and  tuning  of  evaluation  functions. 

8.2  Future  Directions 

The  main  conclusion  of  this  thesis  is  that  learning  evaluation  functions  can  improve 
global  optimization  performance.  STAGE  is  a  simple,  practical  technique  that  demon¬ 
strates  this,  stage’s  simplicity  enables  many  potentially  useful  extensions;  I  suggest 
some  of  these  in  Section  8.2.1.  Beyond  STAGE,  there  are  at  least  two  conceptually 
different  ways  of  utilizing  value  function  approximation  in  global  optimization;  I  dis¬ 
cuss  these  in  Section  8.2.2.  Einally,  in  Section  8.2.3,  I  consider  the  potential  for 
learning  evaluation  functions  by  non-VFA-based,  direct  meta- optimization  methods. 

8.2.1  Extending  STAGE 

Many  modifications  to  and  extensions  of  STAGE,  such  as  varying  the  regression  model 
and  training  technique  used  to  approximate  V’’,  have  been  investigated  in  Chapters  5 
and  6  of  this  thesis.  However,  many  further  interesting  modifications  remain  untried. 
I  describe  several  of  these  here: 

Non-polynomial  function  approximators.  My  study  of  Section  5.2.2  was  limited 
to  first-  through  fifth-order  polynomial  models  of  V^.  It  would  be  interesting  to 
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see  whether  other  linear  architectures — such  as  CMACs,  radial  basis  function 
networks,  and  random-representation  neural  networks — could  produce  better 
fits  and  better  performance. 

A  more  ambitious  study  could  investigate  efficient  ways  to  use  nonlinear  archi¬ 
tectures,  such  as  multi-layer  perceptrons  or  memory-based  fitters,  with  STAGE. 
In  the  context  of  transfer,  the  training  speed  of  the  function  approximator  is  less 
crucial.  One  intriguing  possibility  is  to  learn  a  nonlinear  representation  from 
a  set  of  training  instances,  then  freeze  the  nonlinear  components  so  that  fast 
least-squares  methods  can  be  used  on  the  test  instances.  For  example,  STAGE 
could  learn  a  neural  network  representation  of  V"  from  training  instances,  then 
freeze  the  input-to-hidden  weights  of  the  network  to  allow  linear  learning  on  a 
new  test  instance.  This  could  be  a  useful  way  to  construct  a  feature  set  for  a 
linear  architecture  automatically.  (Related  ideas  are  discussed  in  [Utgoff  96].) 

More  aggressive  optimization  of  V’^.  On  each  iteration,  in  order  to  find  a  promis¬ 
ing  new  starting  state  for  the  baseline  procedure  tt,  STAGE  optimizes  by  per¬ 
forming  first-improvement  hillclimbing.  A  more  aggressive  optimization  tech¬ 
nique,  such  as  simulated  annealing,  could  instead  be  applied  at  that  stage;  and 
that  may  well  improve  performance. 

Steepest  descent.  With  the  exception  of  the  WALKS  AT  results  of  Section  4.7  and 
the  experiments  of  Section  5.2.3,  STAGE  has  been  trained  to  predict  and  im¬ 
prove  upon  the  baseline  procedure  of  tt  =  first-improvement  hillclimbing.  How¬ 
ever,  in  some  optimization  problems — particularly,  those  with  relatively  few 
moves  available  from  each  state — steepest-descent  (best-improvement)  search 
may  be  more  effective.  Steepest-descent  is  proper,  Markovian,  and  monotonic, 
so  STAGE  applies  directly;  and  it  would  be  interesting  to  compare  its  effective¬ 
ness  with  first-improvement  hillclimbing’s. 

Continuous  optimization.  This  dissertation  has  focused  on  discrete  global  opti¬ 
mization  problems.  However,  STAGE  applies  without  modification  to  continu¬ 
ous  global  optimization  problems  (i.e.,  find  x*  =  argminObj  :  3?^’  ^  3?)  as  well. 
The  cartogram  design  problem  of  Section  4.6  is  an  example  of  such  a  problem; 
however,  much  more  sophisticated  neighborhood  operators  than  the  point  per¬ 
turbations  I  defined  for  that  domain  are  available.  For  example,  the  downhill 
simplex  method  of  Nelder  and  Mead  (described  in  [Press  et  al.  92,  §10.4])  pro¬ 
vides  an  effective  set  of  local  search  moves  for  continuous  optimization.  Downhill 
simplex  reaches  a  local  optimum  quickly,  and  Press  et  al.  [92]  recommend  em- 
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bedding  it  within  a  naultiple-restart  or  simulated-annealing  framework.  STAGE 
could  provide  an  effective  learning  framework  for  multi-restart  simplex  search. 

Confidence  intervals.  STAGE  identifies  good  restart  points  by  optimizing  V^(x), 
the  predicted  expected  outcome  of  search  from  x.  However,  in  the  context  of  a 
long  run  involving  many  restarts,  it  may  be  better  to  start  search  from  a  state 
with  worse  expected  outcome  but  higlier  outcome  variance.  After  all,  what 
we  really  want  to  minimize  is  not  the  outcome  of  any  one  trajectory,  but  the 
minimum  outcome  over  the  whole  collection  of  trajectories  STAGE  generates. 
One  possible  heuristic  along  these  lines  would  be  to  exploit  confidence  intervals 
on  V’^’s  predictions  to  guide  search.  For  example,  STAGE  could  evaluate  the 
promise  of  a  state  x  by,  instead  of  the  expected  value  of  a  more  optimistic 

measure  such  as 

•  the  25th-percentile  prediction  of  V‘^{x),  or 

•  the  probability  that  V'^{x)  exceeds  the  best  value  seen  so  far  on  this  run. 

Such  strategies  could  have  the  effect  of  both  encouraging  exploration  of  state- 
space  regions  where  V'^  is  poorly  modeled  (similar  to  the  Interval-Estimation 
[Kaelbling  93]  and  lEMAX  [Moore  and  Schneider  96]  algorithms)  and  encour¬ 
aging  repeated  visits  to  states  that  promise  to  lead  occasionally  to  excellent 
solutions. 

8.2.2  Other  Uses  of  VFA  for  Optimization 

STAGE  exploits  the  value  function  for  the  purpose  of  guiding  search  to  new 
starting  points  for  tt.  However,  value  functions  can  also  aid  optimization  in  at  least 
two  further  ways:  filtering  and  sampling. 

Filtering  refers  to  the  early  cutoff  of  an  unpromising  search  trajectory — before  it 
even  reaches  a  local  optimum— to  conserve  time  for  additional  restarts  and  bet¬ 
ter  trajectories.  Heuristic  methods  for  filtering  have  been  investigated  by,  e.g., 
[Nakakuki  and  Sadeh  94].  Perkins  et  al.  [97]  have  suggested  that  reinforcement¬ 
learning  methods  could  provide  a  principled  mechanism  for  deciding  when  to 
abort  a  trajectory.  In  the  context  of  STAGE,  filtering  could  be  implemented 
simply  as  follows:  cut  off  any  7r-trajectory  when  its  predicted  eventual  outcome 
V^{x)  is  worse  than,  say,  the  mean  of  all  7r-outcomes  seen  thus  far.  This  tech¬ 
nique  would  cillow  STAGE  to  exploit  its  learned  predictions  during  both  stages 
of  search. 
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Sampling  refers  to  the  selection  of  candidate  moves  for  evaluation  during  search. 
In  this  dissertation,  I  have  assumed  that  candidate  moves  are  generated  with 
a  probability  distribution  that  remains  stationary  throughout  the  optimization 
run.  In  optimization  practice,  however,  it  is  often  more  effective  to  modify 
the  candidate  distribution  over  the  course  of  the  search — for  example,  to  gen¬ 
erate  large-step  candidate  moves  more  frequently  early  in  the  search  process, 
and  to  generate  small-step,  fine-tuning  moves  more  frequently  later  in  search. 
Cohn  reviews  techniques  for  adapting  the  sampling  distribution  of  candidate 
moves,  including  one  “based  on  their  probability  of  success  and  on  their  effect 
on  improving  the  cost  function”  [Cohn  92,  §2.4.4].  In  order  to  estimate  these 
quantities  without  having  to  invoke  the  (presumably  expensive)  objective  func¬ 
tion,  Cohn’s  move  generator  maintains  statistics  for  each  category  of  move  that 
has  been  tried  recently  in  the  run — a  simple  kind  of  reinforcement  learning. 

A  more  sophisticated  approach  has  recently  been  proposed  by  Su  et  al.  [98]. 
Their  method  learns,  over  multiple  simulated-annealing  runs,  to  predict  the 
long-term  outcome  achieved  by  starting  search  at  state  x  and  with  initial  ac¬ 
tion  a}  In  reinforcement-learning  terminology,  their  method  learns  to  approx¬ 
imate  the  task’s  state-action  value  function  Q'^(x,a)  [Watkins  89].  This  form 
of  value  function  allows  the  effects  of  various  actions  a  to  be  predicted  with¬ 
out  having  to  actually  apply  the  action  or  invoke  the  objective  function.  Their 
method  uses  the  learned  value  function  to  preselect  the  most  promising  out  of 
five  random  candidate  moves  before  each  step  of  simulated  annealing,  thereby 
saving  time  that  would  have  been  spent  evaluating  bad  candidate  moves.  In 
optimization  domains  where  objective  function  evaluations  are  costly,  the 
value-function  formulation  offers  the  potential  for  significant  speedups.  It  re¬ 
mains  for  future  research  to  determine  how  best  to  combine  filtering,  sampling, 
and  search-guiding  uses  of  value  functions  in  optimization. 

8.2.3  Direct  Meta-Optimization 

AH  the  approaches  discussed  in  this  thesis  have  built  evaluation  functions  by  approxi¬ 
mating  a  value  function  or  V*,  functions  which  predict  the  long-term  outcomes  of  a 
search  policy.  However,  an  alternative  approach  not  based  on  value  function  approxi¬ 
mation,  which  I  call  direct  meta-optimization,  also  applies.  Direct  meta-optimization 
methods  assume  a  fixed  parametric  form  for  the  evaluation  function  and  optimize 
those  parameters  directly  with  respect  to  the  ultimate  objective,  sampled  by  Monte 

^Note  that  the  state  vector  x  for  simulated  annealing  consists  of  both  the  current  configuration 
X  and  the  current  temperature  T. 
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Carlo  simulation.  In  symbols,  given  an  evaluation  function  V{x\w)  parametrized  by 
weights  tn,  we  seek  to  learn  w  by  directly  optimizing  the  meta-objective  function 

M{w)  =  the  expected  performance  of  search  using  evaluation  function  V{x\w)  . 

The  evaluation  functions  V  learned  by  such  methods  are  not  constrained  by  the 
Bellman  equations:  the  values  they  produce  for  any  given  state  have  no  semantic 
interpretation  in  terms  of  long-term  predictions.  The  lack  of  such  constraints  means 
that  less  information  for  training  the  function  can  be  gleaned  from  a  simulation  run. 
The  temporal-difference  goal  of  explicitly  caching  values  from  lookahead  search  into 
the  static  evaluation  function  is  discarded;  only  the  final  costs  of  completed  simulation 
runs  are  available.  However,  there  are  several  reasons  to  believe  that  a  direct  approach 
may  be  effective: 

•  Not  having  to  meet  the  Bellman  constraints  may  actually  make  learning  eas¬ 
ier.  For  example,  even  if  a  domain’s  true  value  function  is  very  jagged,  meta¬ 
optimization  may  discover  a  quite  different,  smooth  V  that  performs  well.  (This 
point  was  also  made  in  [Utgoff  and  Clouse  91].) 

•  Direct  approaches  do  not  depend  on  the  Markov  property.  Since  they  treat 
the  baseline  search  procedure  as  a  black  box,  they  can  be  applied  to  optimize 
the  evaluation  function  for  backtracking  search  algorithms  such  as  A*,  or  in 
sequential  decision  problems  involving  hidden  state. 

•  Similarly,  extra  algorithmic  parameters  such  as  hillclimbing’s  patience  level  and 
simulated  annealing’s  temperature  schedule  can  be  included  along  with  the 
evaluation  function  coefficients  in  the  meta-optimization. 

Further  arguments  supporting  the  direct  approach  are  given  in  [Moriarty  et  al.  97]. 

Direct  meta-optimization  methods  have  been  applied  to  learning  evaluation  func¬ 
tions  before,  particularly  in  the  game-playing  literature.  Genetic-algorithm  approaches 
to  game  learning  generally  fall  into  this  category  (e.g.,  [Tunstall-Pedoe  91]).  Re¬ 
cently,  Pollack  et  al.  attacked  backgammon  by  hillclimbing  over  the  3980  weights  of 
a  neural  network  [Pollack  et  al.  96].  Surprisingly,  this  procedure  developed  a  good 
backgammon  player,  though  not  on  the  level  of  Tesauro’s  TD-Gammon  networks. 
Meta-optimization  has  also  been  applied  successfully  to  aid  combinatorial  optimiza¬ 
tion.  Ochotta’s  simulated  annealing  system  for  synthesizing  analog  circuit  cells  made 
use  of  a  sophisticated  cost  function  parametrized  by  46  real  numbers  [Ochotta  94]. 
These  and  10  other  parameters  of  the  annealer  were  optimized  using  Powell’s  method 
as  described  in  [Press  et  al.  92].  Each  parameter  setting  was  evaluated  by  summing 
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the  mean,  median  and  minimum  final  results  of  200  annealing  runs  on  a  small  repre¬ 
sentative  problem  instance.  After  several  months  of  real  time  and  four  years  of  CPU 
time  (!),  Powell’s  method  produced  an  evaluation  function  which  performed  well  and 
generalized  robustly  to  larger  instances. 

I  believe  that  the  computational  requirements  of  direct  meta-optimization  can  be 
significantly  reduced  by  the  use  of  new  memory-based  stochastic  optimization  tech¬ 
niques  [Moore  and  Schneider  96,  Moore  et  al.  98].  These  techniques  are  designed 
to  optimize  functions  for  which  samples  are  both  expensive  to  gather  and  poten¬ 
tially  very  noisy.  The  meta-objective  function  M  certainly  fits  this  characteriza¬ 
tion,  since  sampling  M  means  performing  a  complete  run  of  the  baseline  stochastic 
search  procedure  and  reporting  the  final  result.  An  important  future  direction  for 
reinforcement-learning  research  is  to  carefully  compare  the  empirical  performance  of 
direct  meta-optimization  and  value  function  approximation  methods. 

8.3  Concluding  Remarks 

In  the  decade  since  the  deep  connection  between  AI  heuristic  search  and  Markov  de¬ 
cision  process  theory  was  first  identified,  the  field  of  reinforcement  learning  has  made 
much  progress.  Algorithms  for  learning  sequential  decision-making  from  simulation 
data  are  now  well  understood  for  tasks  in  which  the  value  function  can  be  represented 
exactly.  This  thesis  has  addressed  the  more  difficult  case  in  which  the  value  function 
must  be  represented  compactly  by  a  function  approximator.  Its  primary  contribution 
is  a  practical  algorithm,  built  on  reinforcement-learning  foundations,  that  efficiently 
learns  and  exploits  a  predictive  model  of  a  search  procedure’s  performance.  Ulti¬ 
mately,  I  believe  such  methods  will  lead  to  more  effective  solutions  to  large  sequential 
decision  problems  in  industry,  science,  and  government,  and  thereby  improve  society. 
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A.l  The  Best-So-Far  Procedure  Is  Markovian 

Here,  I  prove  Proposition  2  of  Section  3.4.1.  This  proposition  gives  us  a  way  to  apply 
a  natural  patience-based  termination  criterion  to  a  nonmonotonic  search  procedure 
such  as  WALKSAT — yet  still  maintain  the  Markov  property  that  makes  the  target 
of  stage’s  learning  well-defined — ^by  using  the  device  of  the  best-so-far  abstraction 
BSF(7r),  as  defined  in  Definition  6  on  page  61.  The  statement  is  as  follows: 

Proposition  0^2,  p.  62).  If  local  search  procedure  tt  is  Markovian  over  a  finite  state 
space  X,  and  it'  is  the  procedure  that  results  by  adding  patience-based  termination  to 
TT,  then  procedure  BSF(7r')  is  proper,  Markovian,  and  strictly  monotonic. 

In  outline,  the  proof  follows  these  three  steps: 

1.  The  procedure  tt'  is  proper,  but  not  necessarily  Markovian  in  the  state  space 
X.  However,  it'  is  proper  and  Markovian  in  an  augmented  state  space  Z. 

2.  We  apply  a  lemma  that  the  Y -abstraction  (defined  below)  of  any  proper  and 
Markovian  procedure  is  also  proper  and  Markovian.  BSF(7r')  is  such  an  abstrac¬ 
tion,  which  proves  that  it  is  proper  and  Markovian  in  the  augmented  state  space 

Z. 

3.  Finally,  we  show  that  the  Markov  property  still  holds  when  trajectories  of 
BSF(7r')  are  projected  back  down  to  the  original  state  space  X.  The  property 
of  strict  monotonicity  also  follows  trivially. 

Before  explaining  these  steps  in  detail,  I  present  the  definition  of  a  F-abstraction 
and  the  lemma  that  Step  2  requires. 

Definition  7.  Let  tt  be  a  proper  local  search  procedure  defined  over  a  state  space 
Z  U  {end},  and  let  Y  C  Z  be  an  arbitrary  subset  of  states.  Then  the  Y -abstraction  of 
procedure  tt  is  a  new  policy  tt',  defined  over  the  smaller  state  space  Y  U  {end},  which 
is  derived  from  tt  as  follows:  starting  from  any  state  yo  €  Y,  tt'  generates  trajectories 
by  following  tt  but  filtering  out  all  states  belonging  to  Z\Y  (see  Figure  A.l). 
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Figure  A.l.  Illustration  of  the  F-abstraction  procedure.  Given  a  trajectory 
(xo, . . .  ,  Xg)  (straight  solid  lines)  generated  by  procedure  tt  in  state  space  Z,  the 
F-abstraction  produces  a  corresponding  shorter  trajectory  (dotted  lines)  that  is  re¬ 
stricted  to  the  subspace  F. 


Lemma  1.  Suppose  procedure  it  is  proper  and  Markovian  over  state  space  ZU{end}^ 
and  that  procedure  tt'  is  the  Y -abstraction  of  Z  for  any  given  subset  Y  C  Z.  Then  tt' 
is  proper  and  Markovian  over  the  subspace  F'  =  F  U  {end}. 

Proof  of  Lemma  1.  First,  note  that  if  we  apply  tt  starting  from  any  state  xq  €  Z, 
we  get  a  trajectory  t  which  must  eventually  visit  a  state  y  G  F^,  since  tt  is  assumed 
proper  and  Y'  includes  the  terminal  state.  Let  ENTER(r,  F')  be  the  first  state  in  Y' 
that  r  visits  after  leaving  xq.  Then,  over  the  full  distribution  of  trajectories  that  tt 
generate  from  xq,  the  probability  that  y  is  the  first  state  subsequently  visited  in 
Y'  is  given  by  p{y\xo)^  computed  as  follows: 

Va:o  €Z,y£Y':  p{y\xf)  =  P{t\xI  =  xq)  (A.l) 

{TeT:ENTER(T,y’')=2/} 

Now  consider  a  trajectory  of  the  F-abstraction  policy  tt',  starting  at  a  state  t/o  6  F. 
By  definition  of  tt' ,  this  trajectory  will  be  built  by  applying  the  original  policy  tt  and 
filtering  out  the  states  in  .^\F.  Thus,  the  first  transition  will  be  to  some  state  yi  €  Y' 
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with  probability  given  precisely  by  p(yi|yo)  as  defined  in  Equation  A.l.  Furthermore, 
by  the  Markov  assumption  on  tt,  the  future  transitions  from  yi  will  be  independent 
of  the  trajectory  history  and  obey  the  same  probabilities  |j/,)  Vi  >  0.  Thus,  tt' 

is  Markovian  over  Y'.  tt'  is  also  clearly  proper,  by  inheritance  from  tt.  □ 

I  now  fill  in  the  details  of  the  proof  of  Proposition  2,  following  the  three-part 
outline  sketched  above. 

Proof  of  Proposition  2. 

1.  In  the  statement  of  the  proposition,  tt'  is  defined  to  be  the  procedure  that  results 
when  patience-based  termination  is  imposed  on  a  procedure  tt  that  is  Markovian, 
but  not  necessarily  proper.  Recall  that  patience-based  termination  means,  for 
a  given  patience  level  Pat  >  1,  that  any  trajectory  will  end  deterministically 
as  soon  as  Pat  consecutive  steps  are  taken  with  no  improvement  over  the  best 
state  found  so  far.  That  is,  for  all  trajectories  r  produced  by  tt': 

{t  >  Pat)  a  =  ^^  min^^Obj(xJ))  =  END. 

Since  the  state  space  X  is  assumed  finite,  tt'  is  certainly  proper:  the  Obj  val¬ 
ues  cannot  keep  decreasing  forever.  However,  tt'  is  not  necessarily  Markovian. 
Because  of  the  new  termination  condition,  the  transition  probability 

P(^I+l  =  I  =  ^0,  xl  =  Xi,...  ,xj  =  Xi) 

is  not  independent  of  the  history  xq,  ■  ■  ■  ,  Xi-i  as  required  by  Definition  3  (page  57) 
In  particular,  the  probability  of  termination  depends  not  just  on  the  current 
state,  but  also  on  how  recently  the  best-so-far  state  was  found,  and  what  the 
Obj  value  at  that  state  was. 

However,  the  procedure  tt'  can  be  made  Markovian  by  augmenting  the  state 
space  with  these  relevant  extra  variables.  Define  the  augmented  state  space 
Z  =  XxR  X  N,  where  the  three  components  of  a  state  z  ^  Z  are  given  by 

a[z)  €  X,  representing  a  state  x  in  the  original  state  space 

b{z)  e  representing  the  best  Obj  value  seen  on  the  current  trajectory 

c{z)  €  N,  representing  the  count  of  non-improving  steps  since  finding  b{z). 

In  this  augmented  space,  a  trajectory  of  procedure  tt'  begins  at  the  state  Zq  = 
Obj(a;5),0).  All  future  state  transitions  depend  only  on  the  current  state 
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zj .  There  are  three  kinds  of  possible  transitions:  (1)  the  patience  counter  may 
expire,  causing  deterministic  termination;  (2)  a  new  best-so-far  state  may  be 
discovered,  in  which  case  b(z)  is  updated  and  c(z)  is  reset  to  zero;  or  (3)  an 
ordinary  transition  to  a  non-best-so-far  state  occurs,  in  which  case  c(z)  is  simply 
incremented.  These  give  rise  to  the  following  Markovian  transition  probabilities: 
Vz  e  Z,z'  e  Z  u  {end}, 

I  =  ^)  = 

II  if  {c{z)  =  Pat)  a  {z'  =  end) 

p{a{z')\a{z))  if  (Obj(a(^'))  <  b(z))  A  {b{z') 

Piaiz')\a{z))  if  (Obj(a(;.'))  >  K^))  A  ib{z') 

0  otherwise. 

Thus,  tt'  is  proper  and  Markovian  over  the  space  Z  U  {end},  concluding  the 
first  segment  of  our  proof. 

2.  We  now  apply  Lemma  1  to  show  that  the  best-so-far  abstraction  of  tt',  denoted 
BSF(7r'),  is  also  proper  and  Markovian  in  the  augmented  state  space  Z.  Let  the 
subset  Y  be  defined  by 

Y  =  {z  e  Z  :  {b{z)  =  Obj(a(z)))  A  (c(z)  =  0)}. 

Y  consists  of  precisely  the  best-so-far  states  on  any  trajectory  of  tt'.  By  Lemma  1, 
BSF(7r')  is  proper  and  Markovian  over  the  subspace  Y  U  {end}.  That  is,  fol¬ 
lowing  procedure  BSF(7r^),  the  transitions  from  any  state  y  are  given  by  fixed 
probabilities  p{y'\y)  independent  of  any  past  history  of  the  procedure. 

3.  Finally,  we  note  that  every  state  y  €  F  has  the  form  (a;,  Obj(a;),  0).  All  the 
information  about  the  state  y  is  determined  by  the  first  component,  the  under¬ 
lying  state  in  the  un-augmented  space.  It  follows  that  the  procedure  BSF(7r') 
is  Markovian  in  the  underlying  space  X  U  {end},  with  transition  probabilities 
given  simply  by 

PK+i  =  x'\x'l  =  x)  =p((a:',Obj(a;'),0)  |  (x,  Obj(x),  0)). 


=  Obj(a(/)))  A  (c(z')  =  0) 
=  H^))  A  {c{z')  =  c{z)  -1- 1) 


We  note  that  the  property  of  strict  monotonicity  follows  trivially  from  the  definition 
of  BSF,  completing  the  proof  of  Proposition  2.  □ 
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A. 2  Least- Squares  TD(1)  Is  Equivalent  to  Linear  Regression 

This  section  demonstrates  that  the  incremental  algorithm  LSTD(l),  which  is  a  special 
case  of  the  LSTD(A)  procedure  introduced  in  Section  6.1.2,  produces  an  approximate 
value  function  which  is  equivalent  to  that  which  would  be  generated  by  standard, 
non-incremental,  least-squares  linear  regression. 

To  be  precise,  assume  we  are  given  a  sample  trajectory  (xq,  xi, . . .  ,  xi,  END)  of  a 
Markov  chain,  with  one-step  rewards  of  R{xj,  Xj+i)  on  each  step.  From  this  trajectory, 
a  supervised  learning  system  would  generate  the  training  pairs: 

4)0  R{xo,xi) R{xi,X2)  -\ - |-/?(Xi,END) 

H- i?(xi,  X2)  +  •  •  •  +  i2(xL,  end) 

:  I-)- : 

<f>L  R{xl,  end) 

where  cj)i  is  the  vector  of  features  representing  state  x,.  Performing  standard  least- 
squares  linear  regression  on  the  above  training  set,  as  described  for  STAGE  in  Chap¬ 
ter  3  (Equation  3.8,  page  67),  produces  the  following  regression  matrices: 

L  L 

Alr  =  ^  <f)i(i>J  Elr  =  ^2 

i=0  1=0 

L 

where  j/,  =  ^i2(xj,Xj+i) 

j  =  ^ 

I  now  show  that,  thanks  to  the  algebraic  trick  of  the  eligibility  vectors  Zt,  LSTD(l) 
builds  the  equivalent  A  and  b  fully  incrementally — without  having  to  store  the  tra¬ 
jectory  while  waiting  to  observe  the  eventual  outcome  j/,.  Please  refer  to  Table  6.1.1 
(p.  140)  for  the  definition  of  the  algorithm. 

Proof.  With  simple  algebraic  manipulations,  the  sums  built  by  LSTD(l)’s  A  and  b 
telescope  neatly  into  Arr  and  bRR,  as  follows: 

L 

A  =  ^  {(fii  ^i+1 ) 

i=0 

=  -  4>i+iy 

^=0  j=0 

iz=0  j=0  4=0  j=0 


(by  definition  of  Zt) 


186 


PROOFS 


L  i  L+1  k-1 

=  (<^o<Ao +^Y1  -Q2Y1 

2=1  J=0  k=l  j=0 

=  <Ao</>J  +  ~'^Y1  ^i^'k 

i=l  j=0  k=\  3=0 

= 4>o(t>i + 5])  <i>3<t>'i  -  ^ 

2=1  j=0  j=0 

L 

2=0 


=  Alr,  as  desired; 


(substituting  k  =  i  +  l) 


(since  (t>L^-i  ==  0) 


(substituting  i  =  k) 


and 


L 

2=0 

L  i 

^  definition  of  Zt) 

2=0  y=o 
LL 

S  (^w/iere  l(rroe)  1,  l{False)  =^0J 

2=0  J=0 

L  L 

^  V  ^  V  ^0  ~  ?  ^2  +  l) 

j=0  2=0 
L  L 

^  ]  4*3  ^  ^  Ri^i,  ajj+i) 
i=o  i=j 

=  bLR,  as  desired. 

These  reductions  prove  that  the  contributions  to  A  and  b  by  any  single  trajectory 
are  identical  in  LSTD(l)  and  least-squares  linear  regression.  In  both  algorithms, 
contributions  from  multiple  trajectories  are  simply  summed  into  the  matrices.  Thus, 
LSTD(l)  and  linear  regression  compute  the  same  statistics  and,  ultimately,  the  same 
coefficients  for  the  approximated  value  function.  □ 
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B.l  Annealing  Schedules 

Simulated  annealing  (SA)  is  described  by  the  template  in  Table  B.l.  The  main  imple¬ 
mentation  challenge  is  to  choose  an  effective  annealing  schedule  for  the  temperature 
parameter.  The  temperature  controls  the  probability  of  accepting  a  step  to  a  worse 
neighbor:  when  U  =  +oo,  SA  accepts  any  step,  acting  as  a  random  walk;  whereas 
when  ti  =  0,  SA  rejects  all  worsening  steps,  acting  as  stochastic  hillclimbing  with 
equi-cost  moves. 


Simulated- Annealing(A,  S,  N,  Obj,  TotEvals,  Schedule): 

Given: 

•  a  state  space  X 

•  starting  states  S  G  X 

•  a  neighborhood  structure  N  :  X  2^ 

•  an  objective  function,  Obj  :  A  3?,  to  be  minimized 

•  TotEvals,  the  number  of  state  evaluations  allotted  for  this  run 

•  an  annealing  Schedule  which  determines  the  temperature  on  each  iteration 

1.  Let  a;o  G  5  be  a  random  starting  state  for  search; 
let  to  :=  the  initial  temperature  of  Schedule 

2.  For  i  :=  0  to  TotEvals  -  1,  do: 

(a)  Choose  x'  :=  &  random  element  from  N{xi) 

fx'  if  rand[0,1)  < 

(b)  Xi+i  :=■{  . 

I  Xi  otherwise 

(c)  Update  temperature  tj+i  according  to  Schedule 

3.  Return  the  best  state  found. 


Table  B.l.  A  template  for  the  simulated  annealing  algorithm 
The  temperature  annealing  schedule  has  been  the  subject  of  extensive  theoreti- 
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cal  and  experimental  analysis  in  the  simulated  annealing  literature  [Boese  96].  The 
possibilities  include  the  following: 

•  Logarithmic:  =  7/log(«  +  2).  For  sufficiently  high  7,  a  schedule  of  this  form 
guarantees  that  SA  will  converge  with  probability  one  to  the  globally  optimal 
solution  [Mitra  et  al.  86,Hajek  88].  Unfortunately,  this  guarantee  only  applies 
in  the  limit  as  TotEvals  — >■  00.  Logarithmic  schedules  are  generally  too  slow 
for  practical  use. 

•  Geometric:  for  a  fixed  initial  temperature  to,  cooling  rate  a  €  (0, 1)  and  round 

length  L  >  1,  define  t;  =  to  •  This  is  the  original  schedule  proposed  by 

Kirkpatrick  [83],  and  is  still  widely  used  in  practice  [Johnson  et  al.  89,  Johnson 
et  al.  91, Press  et  al.  92].  However,  the  parameters  to,  L,  and  a  must  be  tuned 
for  each  problem  instance. 

•  Adaptive:  a  variety  of  adaptive  schedules  have  been  proposed,  with  both  the¬ 
oretical  and  practical  motivations.  [Boese  96]  summarizes  the  most  notable  of 
these,  including  the  schedules  of  [Aarts  and  van  Laarhoven  85],  [Huang  et  al.  86], 
and  [Lam  and  Delosme  88].  According  to  [Ochotta  94], 

Results  for  the  Lam  schedule  are  quite  impressive.  When  compared 
with  other  general-purpose  annealing  schedules  such  as  [Huang  et 
al.  86]  and  even  hand-crafted  schedules  like  the  one  in  [Sechen  and 
Sangiovanni-Vincentelli  86],  it  often  provides  speed-ups  of  50%  while 
actually  improving  the  quality  of  the  final  answers  [Lam  and  Delosme 
88].  Simulated  annealing  with  the  Lam  schedule  even  compares  fa¬ 
vorably  with  heuristic  combinatorial  optimization  methods  tuned  to 
specific  problems  like  partitioning  and  the  travelling  salesman  prob¬ 
lem  [Lam  88]. 

B.2  The  “Modified  Lam”  Schedule 

For  the  purposes  of  comparing  against  STAGE,  I  sought  to  implement  simulated  an¬ 
nealing  with  a  cooling  schedule  which  would  both  perform  very  well  and  would  require 
little  tuning  from  one  problem  instance  to  the  next.  After  some  experimentation,  I 
settled  on  an  adaptive  schedule  similar  to  that  used  in  [Ochotta  94],  which  in  turn  was 
based  on  Swartz’s  modification  of  the  Lam  schedule  [Swartz  and  Sechen  90].  Swartz 
observed  that  for  each  of  a  large  number  of  annealing  runs  using  the  Lam  schedule, 
the  accept  ratio — that  is,  the  ratio  of  moves  accepted  to  moves  considered — ^followed 
an  almost  identical  pattern  with  respect  to  the  move  counter  i  [Ochotta  94,  p.  137]: 
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The  accept  rate  starts  at  about  100%,  decreases  exponentially  until  it  sta¬ 
bilizes  (about  15%  of  the  way  through  the  run)  at  about  44%,  and  remains 
there  until  about  65%  of  the  way  through  the  run.  It  then  continues  its 
exponential  decline  until  the  end  of  the  annealing  process.  Based  on  this 
observation,  Swartz  duplicated  the  effect  of  the  Lam  schedule  by  using  a 
simple  feedback  loop  to  control  the  temperature  and  /orce  the  accept  ratio 
to  follow  the  curve  in  Figure  [B.l].  In  comparing  the  modified  schedule  to 
the  original  schedule,  Swartz  reported  almost  no  difference  in  the  quality 
of  the  final  answers  [Swartz  and  Sechen  90].  One  additional  benefit  of  the 
modified  Lam  schedule  is  that,  in  contrast  to  the  original  schedule,  the 
total  number  of  moves  in  the  annealing  run  can  be  specified  by  the  user. 


Figure  B.l.  Accept  rate  targets  for  the  modified  Lam  schedule 


All  details  of  my  implementation  of  Swartz’s  modified  Lam  schedule  are  given 
below  in  Table  B.2.  I  do  not  dynamically  readjust  the  neighborhood  structure  N{x) 
over  the  course  of  search,  as  the  original  Lam  schedule  did,  since  such  readjustments 
cannot  be  specified  in  a  problem-independent  manner. 
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SIMULATED  ANNEALING 


Simulated- Annealing(X,  S,  N,  Obj,  TotEvals,  Modified-Lam-Schedule): 

1.  Let  a:o  G  5  be  a  random  starting  state  for  search; 

let  AcceptRate:=0.5,  to  :=  0.5. 

2.  For  i  :=  0  to  TotEvals  -  1,  do: 


(a)  Choose  x'  :=  a  random  element  from  N{xi) 

fx'  if  RAND[0,1)  <  ef°*’ha:.)-Obj(a:')]/<i 

(b) a:m:=i  ,,  ^ 

I  Xi  otherwise 

(c)  Update  temperature  as  follows: 


let  AcceptRate  := 


{g^  (499  •  AcceptRate  +1) 
^  (499  •  AcceptRate) 


ii.  let  d  :=  z/TotEvals 


if  x'  was  accepted 
if  x'  was  rejected 


{0.44  +  0.56  •  if0<d<0.15 

0.44  if  0.15  <d<  0.65 

0.44  •  440-(<^-o-65)/o.35  jf  Q  05  <d<l 


iv.  let 


{ti  •  0.999  if  AcceptRate  >  TargRate 
ii/0.999  otherwise 


3.  Return  the  best  state  found. 


Table  B.2.  Details  of  the  “modified  Lam”  adaptive  simulated  annealing  schedule, 
instantiating  the  template  of  Figure  B.l.  The  TargRate  function  is  plotted  in 
Figure  B.l. 


§B.3  EXPERIMENTS 
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B.3  Experiments 

I  certainly  do  not  claim  that  the  modified  Lam  schedule  is  perfectly  optimized  for 
every  problem  attempted  in  Chapter  4.  Getting  the  best  performance  from  simu¬ 
lated  annealing  on  any  given  problem  is  an  art  that  involves  refining  the  temperature 
schedule,  dynamically  adjusting  the  search  neighborhood,  and  tuning  the  cost  func¬ 
tion  coefficients.  However,  empirically,  the  simulated  annealing  algorithm  defined  by 
Table  B.2  does  seem  to  perform  very  well  on  a  wide  variety  of  problems  without 
requiring  further  tuning. 

The  following  experiments  illustrate  the  effectiveness  of  my  implementation.  I 
consider  four  of  the  optimization  domains  of  Chapter  4:  bin-packing  (§4.2),  VLSI 
channel  routing  (§4.3),  Bayes  net  learning  (§4.4),  and  cartogram  design  (§4.6).  For 
each  of  these  domains,  I  compare  the  performance  of  the  modified  Lam  schedule 
with  that  of  12  different  geometric  cooling  schedules,  defined  by  ti  =  to  ■  a*  for  all 
combinations  of 


initial  temperature  to  ^  {10, 1,0.1,0.01} 

cooling  rate  a  G  {0.9999, 0.99999, 0.999999}. 

The  total  length  of  the  schedules  is  set  to  TotEvals,  the  same  setting  used  in  the 
comparative  experiments  of  Chapter  4:  10^  for  bin-packing  and  Bayes  net  learning, 
5  •  10®  for  channel  routing,  and  10®  for  cartogram  design. 

Results  are  shown  in  Figures  B.2-B.5.  Each  figure  plots  thirteen  boxes:  the  left¬ 
most  box  corresponds  to  the  modified  Lam  schedule  results  (copied  from  Chapter  4), 
and  the  other  12  boxes  correspond  to  the  performance  of  the  various  geometric  cool¬ 
ing  schedules  (averaged  over  10  runs  each).  In  all  cases,  the  modified  Lam  schedule  is 
nearly  as  effective  as,  if  not  more  effective  than,  the  best  of  the  geometric  schedules. 
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SIMULATED  ANNEALING 


Figure  B.2.  Simulated  annealing  schedules  for  bin-packing  instance  u250_13:  Mod¬ 
ified  Lam  schedule  (leftmost)  versus  12  geometric  cooling  schedules 


Figure  B.3.  Simulated  annealing  schedules  for  channel  routing  instance  YK4:  Mod¬ 
ified  Lam  schedule  (leftmost)  versus  12  geometric  cooling  schedules 


§B.3  EXPERIMENTS 
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Lam  geo1  geo2  geo3  geo4  geo5  geo6  geo7  geo8  geo9  geo10  geo11  geo12 

Figure  B.4.  Simulated  annealing  schedules  for  Bayes  net  structure-finding  instance 
SYNTH  125K:  Modified  Lam  schedule  (leftmost)  versus  12  geometric  cooling  schedules 


Lam  geo1  geo2  geo3  geo4  geo5  geo6  geo7  geo8  geo9  geo10  geo11  geo12 


Figure  B.5.  Simulated  annealing  schedules  for  cartogram  design  instance  US49: 
Modified  Lam  schedule  (leftmost)  versus  12  geometric  cooling  schedules 
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Appendix  C 

Implementation  Details  of  Problem  Instances 


This  appendix  gives  implementation  details  for  the  optimization  problems  used  in  the 
experiments  of  Chapter  4.  For  further  information,  including  the  complete  datasets 
used  in  the  bin-packing,  Bayes  net  structure-finding,  and  cartogram  domains,  please 
access  the  following  web  page: 

http : / /www . cs . emu . edu/~AUT0N/ stage/  (C.l) 

C.l  Bin-packing 

The  bin-packing  problem  was  introduced  in  Sections  3.3.2  and  4.2.  We  are  given  a  bin 
capacity  C  and  a  list  L  =  (ai,  02,  —an)  of  items,  each  having  a  size  s{ai)  >  0.  The  goal 
is  to  pack  the  items  into  as  few  bins  as  possible,  i.e.,  partition  them  into  a  minimum 
number  m  of  subsets  Bi,  B2, ...,  Bm  such  that  for  each  Bj,  —  O. 

In  the  STAGE  experiments,  value  function  approximation  was  done  with  respect 
to  two  state  features:  the  objective  function  itself,  and  the  variance  in  bin  fullness 
levels.  In  terms  of  the  above  notation,  given  a  packing  x  =  {Rj,  B^, . . .  ,  Bm^}  (with 
all  bins  Bj  assumed  non-empty),  we  have 

Obj(a:)  =  M, 

(1  Mx  \  /^  1  ^ 

where 

fullness(Rj)  ^  ^  s(ai)  . 

The  illustrative  example  of  Section  3.3.2  consisted  of  the  following  30  items,  to  be 
packed  into  bins  of  capacity  100: 

(27,  23,  23,  23,  27,  26,  26,  51,  26,  23, 

51,  23,  51,  23,  23,  51,  23,  23,  27,  23, 

51,  27,  51,  26,  27,  23,  26,  27,  26,  23) 
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IMPLEMENTATION  DETAILS  OF  PROBLEM  INSTANCES 


These  particular  item  sizes  were  motivated  by  Figure  2.5  of  [Coffman  et  al.  96],  which 
depicts  a  template  of  a  worst-case  example  for  the  “First  Fit  Decreasing”  offline  bin¬ 
packing  heuristic.  The  optimal  packing  fills  9  bins  exactly  to  capacity. 

The  twenty  instances  of  the  u250  class  were  contributed  to  the  Operations  Re¬ 
search  Library  by  Falkenauer  [96].  They  may  be  downloaded  from  the  web  page 
referenced  at  the  start  of  this  section,  or  directly  from  the  OR-Library  web  site  at 

http://www.ms.ic.ac.uk/info.html. 


C.2  VLSI  Channel  Routing 

My  implementation  of  channel  routing  follows  that  of  the  SACR  system  [Wong  et 
al.  88].  This  system  allows  a  restricted  form  of  doglegging,  whereby  a  net  may  be 
split  horizontally  only  at  columns  containing  a  pin  belonging  to  that  net. 

Most  of  the  experiments  in  this  dissertation  were  conducted  on  the  instance  YK4. 
That  instance  is  specified  by  the  following  pin  columns: 

Upper:  17  9  23  33  0  17  34  33  32  31  32  20  9  10  21  34  0  31  22  10  0  22  1  3  16  0  0  0  9 
19  7  0  16  14  7  51  43  57  67  0  51  68  67  66  65  66  54  43  44  55  68  0  65  56  44  0  56 
35  37  50  0  0  0  43  53  41  0  50  48  41  85  77  91  101  0  85  102  101  100  99  100  88  77 
78  89  102  0  99  90  78  0  90  69  71  84  0  0  0  77  87  75  0  84  82  75  119  111  125  135  0 
119  136  135  134  133  134  122  111  112  123  136  0  133  124  112  0  124  103  105  118 
0  0  0  111  121  109  0  118  116  109 

Lower:  000  24  10  042  21  24  23  14  24  14201403  19  23  20  23  14  319  24  0 
0  0  0  58  44  0  38  36  55  36  38  57  35  38  58  35  38  36  0  35  38  0  37  53  36  37  54  36 
37  48  37  35  43  58  0  0  0  0  92  78  0  72  70  89  70  72  91  69  72  92  69  72  70  0  69  72  0 
71  87  70  71  88  70  71  82  71  69  77  92  0  0  0  0  126  112  0  106  104  123  104  106  125 
103  106  126  103  106  104  0  103  106  0  105  121  104  105  122  104  105  116  105  103 
111  126  0 

These  pin  columns  correspond  to  Example  1  (Figure  25)  of  [Yoshimura  and  Kuh 
82],  but  multiplied  four  times  and  placed  side  by  side.  By  “cloning”  the  problem  in 
this  manner,  we  maintain  the  known  global  optimum  (in  this  case,  10  tracks  when 
restricted  doglegging  is  allowed),  but  make  the  problem  much  more  difficult  for  local 
search. 

In  Section  6.2,  I  applied  STAGE  to  eight  additional  channel  routing  instances, 
HYCl  through  HYC8.  The  pin  columns  for  these  problems  may  be  found  in  [Chao 
and  Harper  96]  and  are  also  available  on  the  STAGE  web  page. 


§C.3  BAYES  NETWORK  LEARNING 
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C.3  Bayes  Network  Learning 

For  the  experiments  with  the  Bayes-net  structure-finding  domain  (Sections  4.4  and  6.1.5), 
three  datasets  were  used:  MPG,  ADULT2,  and  SYNTH125K. 

•  The  MPG  dataset  contains  information  on  the  horsepower,  weight,  gas  mileage, 
and  other  such  data  (10  total  attributes)  for  392  automobiles.  It  is  derived  from 
the  “Auto-Mpg”  dataset  available  from  the  UCI  Machine  Learning  repository 
[Merz  and  Murphy  98],  but  modified  by  coarsely  discretizing  all  continuous 
variables. 

•  The  ADULT2  dataset  was  also  obtained  from  the  UCI  repository.  It  consists  of 
census  data  related  to  the  job,  wealth,  nationality,  etc.  (15  total  attributes)  on 
30,162  individuals. 

•  The  SYNTH  125K  dataset  was  generated  synthetically  from  the  probability  dis¬ 
tribution  given  by  the  Bayes  net  in  Figure  4.7  (p.  86),  designed  by  Moore  and 
Lee  [98]. 

All  three  data.sets  are  available  for  downloading  from  the  STAGE  web  page. 


C.4  Radiotherapy  Treatment  Planning 

The  radiotherapy  treatment  planning  domain  of  Section  4.5  is  too  complex  to  explain 
in  full  detail  here.  Please  refer  to  the  web  page  for  more  information.  Here,  I  describe 
the  domain  in  enough  detail  to  illustrate  its  complexity — in  particular,  to  show  why 
we  must  resort  to  using  local  search  rather  than,  say,  linear  programming  to  solve 
it.  The  form  of  my  objective  function  was  based  on  discussions  with  domain  experts; 
however,  I  did  not  have  access  to  a  medically  accurate  implementation  of  the  dose 
calculations  and  penalty  functions.  Thus,  I  claim  only  that  the  optimization  problem 
solved  here  retains  most  of  the  overall  structure  and  geometry  of  tradeoffs  found  in 
the  true  medical  domain. 

I  formulated  the  problem  as  follows.  The  treatment  area  is  discretized  into  an 
80  X  80  rectangular  grid.  The  radiation  dosage  dosep(a;)  at  each  pixel  p  can  then 
be  calculated  from  the  current  plan  x  according  to  a  known  forward  model.  Pixels 
within  the  area  of  the  tumor  t  have  a  target  dose  targ^  and  incur  a  penalty  based  on 
the  ratio  rp  = 

V  targt 


Penalty(p)  = 


exp(l/ max(rp,0.1))  if  <  1  (underdose) 
Tp  —  l  if  Tp  >  1  (overdose) 
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IMPLEMENTATION  DETAILS  OF  PROBLEM  INSTANCES 


Pixels  within  a  sensitive  structure  s  have  a  maximum  acceptable  dose  accep^  and 
incur  a  penalty  based  on  the  ratio  r„  =  ^2!£eM; 

*  3iCC6p)£ 


Penalty(p)  =  {'''• 

[^exp(rp)  if  Tp  >  1  (overdose) 

These  two  penalty  functions  are  plotted  in  Figure  C.l.  Finally,  the  overall  objective 
function  is  calculated  as  a  weighted  sum  of  all  the  per-pixel  penalties.  The  weights 
are  fixed  and  reflect  the  relative  importance  of  the  various  structures  being  targeted 
or  protected. 


JH 

CL 

I 

£ 


Figure  C.l.  The  per-pixel  penalty  for  pixels  within  a  tumor  (left)  or  within  a 
sensitive  structure  (right).  Note  the  logarithmic  scale  on  the  y-axis. 


This  problem  formulation  is  nearly  suitable  for  a  linear  programming  solution:  the 
dose  incurred  at  each  pixel  is  a  linear  combination  of  the  beam  intensities,  and  the 
objective  function  is  a  linearly  weighted  sum  of  penalty  terms.  If  the  penalty  functions 
of  Figure  C.l  had  been  defined  to  be  piecewise  linear  and  convex,  and  the  beam 
intensities  were  allowed  to  vary  continuously  over  [0, 1],  then  the  optimal  treatment 
plan  X*  would  be  obtainable  by  linear  programming.  However,  with  0/1  discrete  beam 
intensities  and  non-convex  penalty  functions,  as  assumed  in  the  experiments  of  this 
thesis,  we  must  resort  to  heuristic  search. 


§C.5  CARTOGRAM  DESIGN 
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C.5  Cartogram  Design 

The  cartogram  design  problem  was  introduced  in  Section  4.6.  This  thesis  investigated 
a  single  instance,  US49.  Instance  US49  is  defined  by  the  49  polygons  of  the  continental 
U.S.  map  (48  states  plus  the  District  of  Columbia)  and  their  respective  target  areas, 
which  are  based  on  the  1990-2000  electoral  vote  for  U.S.  President.  This  dataset  is 
available  from  the  STAGE  web  page. 

Local  search  moves  in  this  domain  consist  of  choosing  one  of  the  162  points  and 
perturbing  it  slightly.  These  points  are  chosen  randomly,  but  with  a  bias  toward 
those  points  that  contribute  most  to  the  current  map  error  function.  To  be  precise: 
first  a  state  is  chosen  with  probability  proportional  to  its  contribution  to  Obj(a;),  and 
then  one  point  on  that  state  is  chosen  uniformly  at  random.  The  perturbation  is  then 
generated  by  adding  uniformly  random  numbers  in  [—1.5, 1.5]  to  each  coordinate. 

The  objective  function  is  defined  as  the  sum  of  four  penalty  terms: 

Obj(x)  =  AArea(a::)  +  AGape(a;)  +  AOrient(a;)  +  ASegfrac(a:) 


The  penalty  terms  are  defined  as  follows: 


AArea(a;) 

AGape(a;) 

AOrient(a;) 

ASegfrac(a;) 


JLU  /  /  Area^(s)  \  ..\2 

^States  ™^^  Areatarg(-s)’ max(Area2;(s),  0.001)  . 

St  ot3.t6S 

(measurea,(ZABC')  —  measureorig(ZAjBC'))^ 


Areatarg(^) 


#  Bends 

1 

#Bends 

10 

#  Bends 


/.ABC  £3ends 


^2  (measure  (ZAB__)  —  measureorig(ZAJ5_))'' 


Zj4J5C7GBends 


E.  lengtha;(AB)  iengtnorigl.^.D; 

,  ^perimx(State(ZABC'))  perimorig(State(ZABC')) 


lengthorig(AB) 


The  “Bends”  above  index  each  angle  of  each  polygon  in  the  map.  The  notation 

•H- 

“ZAB__”  refers  to  the  angle  made  between  line  AB  and  the  a;-axis  (a  fixed  horizontal 
line).  Finally,  the  constants  1  and  10  in  these  penalty  terms  were  chosen  by  trial  and 
error  to  create  aesthetically  appealing  cartograms. 


C.6  Boolean  Satisfiability 

The  diflScult  “32-bit  parity  function  learning”  instances  tested  in  Section  4.7  are 
described  in  [Crawford  93].  The  complete  formulas  are  available  for  downloading 
from  the  STAGE  web  page,  and  also  from  the  DIM  ACS  satisfiability  archive: 

ftp : //dimacs .rutgers . edu/pub/challenge/ sat/benchmarks/cnf/  . 
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